Sebastian Nagel created NUTCH-1878:
--------------------------------------

             Summary: urlnormalizer-regex to keep third slash in 
file:///path/index.html
                 Key: NUTCH-1878
                 URL: https://issues.apache.org/jira/browse/NUTCH-1878
             Project: Nutch
          Issue Type: Sub-task
          Components: protocol
    Affects Versions: 2.2.1, 1.9
            Reporter: Sebastian Nagel
             Fix For: 2.3, 1.10


The rule
{code}
<!-- removes duplicate slashes -->
<regex>
  <pattern>(?&lt;!:)/{2,}</pattern>
  <substitution>/</substitution>
</regex>
{code}
in {{regex-normalize.xml}} removes the third slash in 
{{file:///path/index.html}}. The resulting URL {{file://path/index.html}} fails 
to fetch because {{path}} is interpreted as host part of the URL as in 
{{file://localhost/path/index.html}}, cf. 
[wikipedia|http://en.wikipedia.org/wiki/File_URI_scheme], [RFC 
1738|http://tools.ietf.org/html/rfc1738] (1994), and [RFC 
3986|http://tools.ietf.org/html/rfc3986] (2005).

(split as sub-task from NUTCH-1483)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to