[ 
https://issues.apache.org/jira/browse/NUTCH-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579438#action_12579438
 ] 

Andrzej Bialecki  commented on NUTCH-620:
-----------------------------------------

It would be interesting to see the source HTML, which causes these links to 
appear ... I think your point is valid, Nutch should collapse such adjacent 
slashes. Could you provide a patch to BasicURLNormalizer that implements this 
rule?

> BasicURLNormalizer should collapse runs of slashes with a single slash
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-620
>                 URL: https://issues.apache.org/jira/browse/NUTCH-620
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: JDK 1.6 update 5, Tomcat 6, Windows Server 2003, 
>            Reporter: Mark DeSpain
>            Priority: Minor
>             Fix For: 1.0.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The BasicURLNormalizer should collapse runs of slash characters '/' with a 
> single slash.  
> For example,  the following URLs should be normalized to 
> http://lucene.apache.org/nutch/about.html
> * http://lucene.apache.org/nutch//about.html 
> * http://lucene.apache.org//nutch/about.html 
> * http://lucene.apache.org/////nutch////about.html (an exaggerated example)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to