[ 
https://issues.apache.org/jira/browse/NUTCH-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237964#comment-14237964
 ] 

Markus Jelsma commented on NUTCH-1894:
--------------------------------------

Hi - i would recommend to keep this default enabled. We have analyzed millions 
of hosts and a good deal of them errorniously produce URL's with multiple 
sequential slashes with no semantic difference between them, they are simply 
duplicates. Thus far we have seen one website that actually uses it, for a 
terrible reason.

The web is garbage, and this is a sensible default to deal with one of the 
problems. You can always disable it :)

> Revert "Normalize duplicate slashes in URL's"
> ---------------------------------------------
>
>                 Key: NUTCH-1894
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1894
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Jigal van Hemert
>            Priority: Minor
>
> Duplicate slashes are allowed in URL's according to the RFC's and quite a few 
> websites use them in a meaningful way. These websites have specific 
> information at certain positions in the URL and an empty segment indicates 
> that there is no data for that information.
> Removing duplicate slashes makes the URL invalid in those cases and the 
> target page can't be fetched for indexing.
> See issue NUTCH-1011



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to