[ 
https://issues.apache.org/jira/browse/NUTCH-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1485:
----------------------------------------

    Fix Version/s: 2.2
    
> TableUtil reverseURL to keep userinfo part
> ------------------------------------------
>
>                 Key: NUTCH-1485
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1485
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.1
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 2.2
>
>
> The reversed URL key does not contain the userinfo part of an URL (user name 
> and password: {{ftp://user:passw...@ftp.xyz/file.txt}}, cf. [RFC 
> 3986|http://tools.ietf.org/html/rfc3986] and 
> [http://en.wikipedia.org/wiki/URI_scheme]. Keeping the userinfo would make it 
> easy to crawl a fixed list of protected content. However, URLs with userinfo 
> can be tricky, eg 
> [http://cnn.com&story=breaking_news@199.239.136.200/mostpopular], so it's ok 
> when the default is to remove the userinfo. But this should be done in 
> default URL normalizers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to