[ 
https://issues.apache.org/jira/browse/NUTCH-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254204#comment-14254204
 ] 

Sebastian Nagel commented on NUTCH-1894:
----------------------------------------

I would also opt to keep this rule. Duplicates are the more common and serious 
problem for a crawler.

> Revert "Normalize duplicate slashes in URL's"
> ---------------------------------------------
>
>                 Key: NUTCH-1894
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1894
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Jigal van Hemert
>            Priority: Minor
>
> Duplicate slashes are allowed in URL's according to the RFC's and quite a few 
> websites use them in a meaningful way. These websites have specific 
> information at certain positions in the URL and an empty segment indicates 
> that there is no data for that information.
> Removing duplicate slashes makes the URL invalid in those cases and the 
> target page can't be fetched for indexing.
> See issue NUTCH-1011



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to