[ https://issues.apache.org/jira/browse/NUTCH-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579719#action_12579719 ]
Mark DeSpain commented on NUTCH-620: ------------------------------------ I did a quick grep of the HTML that was being crawled, and the site did indeed have two anchor tags of the form <a href="../..//javadoc/myPackage/MyClass.html"> A grep of hadoop.log shows that those "parent" pages and everything reachable from them get visited quite a few times. I'm guessing is because, in my case, a cycle existed and because of the need to collapse adjacent slashes. I'm not sure what eventually stops the crawl, but I would assume it is capped by the maximum crawl depth. > BasicURLNormalizer should collapse runs of slashes with a single slash > ---------------------------------------------------------------------- > > Key: NUTCH-620 > URL: https://issues.apache.org/jira/browse/NUTCH-620 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.9.0 > Environment: JDK 1.6 update 5, Tomcat 6, Windows Server 2003, > Reporter: Mark DeSpain > Priority: Minor > Fix For: 1.0.0 > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > The BasicURLNormalizer should collapse runs of slash characters '/' with a > single slash. > For example, the following URLs should be normalized to > http://lucene.apache.org/nutch/about.html > * http://lucene.apache.org/nutch//about.html > * http://lucene.apache.org//nutch/about.html > * http://lucene.apache.org/////nutch////about.html (an exaggerated example) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.