[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131635#comment-13131635
]
Markus Jelsma commented on NUTCH-1098:
--------------------------------------
I am happy with the encoding of space to %20 but i am still not sure if
decoding is without problems, especially decoding %2F to / (slash) as both have
a different meaning as slash denotes the path. I went through a part of our
CrawlDB searching for %2F occurences and got a list. I tried several URL's
alternating between %2F and /. There are at least a few URL's that return
different content.
This is correct:
http://www.detelefoongids.nl/bg-l/18236834-B%2Fmak+Bedrijfsmakelaars/vermelding/
But this is incorrect:
http://www.detelefoongids.nl/bg-l/18236834-B/mak+Bedrijfsmakelaars/vermelding/
This also applies for decoding %23 to # (hash). By decoding this character
valid URL's are seen having a URL fragment which is then normalized away. See
these examples:
http://www.noordhollandsdagblad.nl/nieuws/stadstreek/denhelder/article9516101.ece/%23Nieuwedweep%3A-Ambassadeur-van-Den-Helder-zijn-we-allemaal
http://www.motorstek.nl/gebruiker/fido%20%2377
In the above examples the encoding of the hash is intented and correct. The
content contains a hash but the URL does not contain a fragment but this would:
http://www.motorstek.nl/gebruiker/fido%20%2377#part-of-page
Another example of bad decoding is :
http://st-annaland.citysite.nl/nieuws/6276724/2816_Rosenthal+ontvangt+oppositieleider+Syri%26%23235%3B.html.
I propose not to decode characters that have special meaning in URI's and be
very careful in tests. As this is a very crucial issue and the implementation
is not sound yet i propose to push this to 1.5 as i'm not very comfortable by
introducing a potential big issue just before 1.4 is to be released.
> better url-normalizer basic
> ---------------------------
>
> Key: NUTCH-1098
> URL: https://issues.apache.org/jira/browse/NUTCH-1098
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.3
> Environment: Any
> Reporter: Radim Kolar
> Assignee: Markus Jelsma
> Labels: encoding, url
> Fix For: 1.4
>
> Attachments: nutch.diff, patch-urlnormalizer.diff
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do
> not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding
> duplicates
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira