[ 
https://issues.apache.org/jira/browse/NUTCH-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579719#action_12579719
 ] 

Mark DeSpain commented on NUTCH-620:
------------------------------------

I did a quick grep of the HTML that was being crawled, and the site did indeed 
have two anchor tags of the form

<a href="../..//javadoc/myPackage/MyClass.html">

A grep of hadoop.log shows that those "parent" pages and everything reachable 
from them get visited quite a few times.  I'm guessing is because, in my case, 
a cycle existed and because of the need to collapse adjacent slashes.  

I'm not sure what eventually stops the crawl, but I would assume it is capped 
by the maximum crawl depth.

 

> BasicURLNormalizer should collapse runs of slashes with a single slash
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-620
>                 URL: https://issues.apache.org/jira/browse/NUTCH-620
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: JDK 1.6 update 5, Tomcat 6, Windows Server 2003, 
>            Reporter: Mark DeSpain
>            Priority: Minor
>             Fix For: 1.0.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The BasicURLNormalizer should collapse runs of slash characters '/' with a 
> single slash.  
> For example,  the following URLs should be normalized to 
> http://lucene.apache.org/nutch/about.html
> * http://lucene.apache.org/nutch//about.html 
> * http://lucene.apache.org//nutch/about.html 
> * http://lucene.apache.org/////nutch////about.html (an exaggerated example)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to