[ 
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131635#comment-13131635
 ] 

Markus Jelsma commented on NUTCH-1098:
--------------------------------------

I am happy with the encoding of space to %20 but i am still not sure if 
decoding is without problems, especially decoding %2F to / (slash) as both have 
a different meaning as slash denotes the path. I went through a part of our 
CrawlDB searching for %2F occurences and got a list. I tried several URL's 
alternating between %2F and /. There are at least a few URL's that return 
different content.

This is correct:
http://www.detelefoongids.nl/bg-l/18236834-B%2Fmak+Bedrijfsmakelaars/vermelding/

But this is incorrect:
http://www.detelefoongids.nl/bg-l/18236834-B/mak+Bedrijfsmakelaars/vermelding/

This also applies for decoding %23 to # (hash). By decoding this character 
valid URL's are seen having a URL fragment which is then normalized away. See 
these examples:

http://www.noordhollandsdagblad.nl/nieuws/stadstreek/denhelder/article9516101.ece/%23Nieuwedweep%3A-Ambassadeur-van-Den-Helder-zijn-we-allemaal
http://www.motorstek.nl/gebruiker/fido%20%2377

In the above examples the encoding of the hash is intented and correct. The 
content contains a hash but the URL does not contain a fragment but this would: 
http://www.motorstek.nl/gebruiker/fido%20%2377#part-of-page

Another example of bad decoding is : 
http://st-annaland.citysite.nl/nieuws/6276724/2816_Rosenthal+ontvangt+oppositieleider+Syri%26%23235%3B.html.

I propose not to decode characters that have special meaning in URI's and be 
very careful in tests. As this is a very crucial issue and the implementation 
is not sound yet i propose to push this to 1.5 as i'm not very comfortable by 
introducing a potential big issue just before 1.4 is to be released.
                
> better url-normalizer basic
> ---------------------------
>
>                 Key: NUTCH-1098
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1098
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.3
>         Environment: Any
>            Reporter: Radim Kolar
>            Assignee: Markus Jelsma
>              Labels: encoding, url
>             Fix For: 1.4
>
>         Attachments: nutch.diff, patch-urlnormalizer.diff
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do 
> not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding 
> duplicates

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to