Sebastian Nagel created NUTCH-1878:
--------------------------------------
Summary: urlnormalizer-regex to keep third slash in
file:///path/index.html
Key: NUTCH-1878
URL: https://issues.apache.org/jira/browse/NUTCH-1878
Project: Nutch
Issue Type: Sub-task
Components: protocol
Affects Versions: 2.2.1, 1.9
Reporter: Sebastian Nagel
Fix For: 2.3, 1.10
The rule
{code}
<!-- removes duplicate slashes -->
<regex>
<pattern>(?<!:)/{2,}</pattern>
<substitution>/</substitution>
</regex>
{code}
in {{regex-normalize.xml}} removes the third slash in
{{file:///path/index.html}}. The resulting URL {{file://path/index.html}} fails
to fetch because {{path}} is interpreted as host part of the URL as in
{{file://localhost/path/index.html}}, cf.
[wikipedia|http://en.wikipedia.org/wiki/File_URI_scheme], [RFC
1738|http://tools.ietf.org/html/rfc1738] (1994), and [RFC
3986|http://tools.ietf.org/html/rfc3986] (2005).
(split as sub-task from NUTCH-1483)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)