Jigal van Hemert created NUTCH-1894:
---------------------------------------

             Summary: Revert "Normalize duplicate slashes in URL's"
                 Key: NUTCH-1894
                 URL: https://issues.apache.org/jira/browse/NUTCH-1894
             Project: Nutch
          Issue Type: Improvement
            Reporter: Jigal van Hemert
            Priority: Minor


Duplicate slashes are allowed in URL's according to the RFC's and quite a few 
websites use them in a meaningful way. These websites have specific information 
at certain positions in the URL and an empty segment indicates that there is no 
data for that information.
Removing duplicate slashes makes the URL invalid in those cases and the target 
page can't be fetched for indexing.

See issue NUTCH-1011



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to