Re: ran into a site that sends a crawl into an infinite loop

Doug Cutting Wed, 31 Aug 2005 11:32:21 -0700

Kamil Wnuk wrote:

In the process of a moderately sized crawl I was running, I hit a page
that sent nutch into an infinite fetch cycle. The page that I hit
contained relative links to itself with the syntax "/page.shtml".  So
once the initial page was fetched, each new generated fetchlist
contained the same url with another "/page.shtml" appended onto the
end.  This caused nutch to fetch urls such as
"http://www.website.com/page.shtml/page.shtml/page.shtml/page.shtml/";;
a process which could go on indefinitely.


How can I prevent this from happening from the nutch end (I do not
have control of the site, and such a problem could always arise
elsewhere)?


This can be done with a regex filter, as follows:

# skip URLs with slash-delimited segment that repeats 3+ times, to breakloops

-.*(/.+?)/.*?\1/.*?\1/

This regex is in the default config files in the mapred branch, and willthus make its way into the trunk soon.


Doug

Re: ran into a site that sends a crawl into an infinite loop

Reply via email to