Jigal van Hemert created NUTCH-1894:
---------------------------------------
Summary: Revert "Normalize duplicate slashes in URL's"
Key: NUTCH-1894
URL: https://issues.apache.org/jira/browse/NUTCH-1894
Project: Nutch
Issue Type: Improvement
Reporter: Jigal van Hemert
Priority: Minor
Duplicate slashes are allowed in URL's according to the RFC's and quite a few
websites use them in a meaningful way. These websites have specific information
at certain positions in the URL and an empty segment indicates that there is no
data for that information.
Removing duplicate slashes makes the URL invalid in those cases and the target
page can't be fetched for indexing.
See issue NUTCH-1011
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)