[
https://issues.apache.org/jira/browse/NUTCH-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14254204#comment-14254204
]
Sebastian Nagel commented on NUTCH-1894:
----------------------------------------
I would also opt to keep this rule. Duplicates are the more common and serious
problem for a crawler.
> Revert "Normalize duplicate slashes in URL's"
> ---------------------------------------------
>
> Key: NUTCH-1894
> URL: https://issues.apache.org/jira/browse/NUTCH-1894
> Project: Nutch
> Issue Type: Improvement
> Reporter: Jigal van Hemert
> Priority: Minor
>
> Duplicate slashes are allowed in URL's according to the RFC's and quite a few
> websites use them in a meaningful way. These websites have specific
> information at certain positions in the URL and an empty segment indicates
> that there is no data for that information.
> Removing duplicate slashes makes the URL invalid in those cases and the
> target page can't be fetched for indexing.
> See issue NUTCH-1011
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)