Sebastian Nagel created NUTCH-1485:
--------------------------------------
Summary: TableUtil reverseURL to keep userinfo part
Key: NUTCH-1485
URL: https://issues.apache.org/jira/browse/NUTCH-1485
Project: Nutch
Issue Type: Improvement
Affects Versions: 2.1
Reporter: Sebastian Nagel
Priority: Minor
The reversed URL key does not contain the userinfo part of an URL (user name
and password: {{ftp://user:[email protected]/file.txt}}, cf. [RFC
3986|http://tools.ietf.org/html/rfc3986] and
[http://en.wikipedia.org/wiki/URI_scheme]. Keeping the userinfo would make it
easy to crawl a fixed list of protected content. However, URLs with userinfo
can be tricky, eg
[http://cnn.com&[email protected]/mostpopular], so it's ok
when the default is to remove the userinfo. But this should be done in default
URL normalizers.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira