Kamil Wnuk wrote:
In the process of a moderately sized crawl I was running, I hit a page
that sent nutch into an infinite fetch cycle. The page that I hit
contained relative links to itself with the syntax "/page.shtml". So
once the initial page was fetched, each new generated fetchlist
contained the same url with another "/page.shtml" appended onto the
end. This caused nutch to fetch urls such as
"http://www.website.com/page.shtml/page.shtml/page.shtml/page.shtml/";
a process which could go on indefinitely.
How can I prevent this from happening from the nutch end (I do not
have control of the site, and such a problem could always arise
elsewhere)?
This can be done with a regex filter, as follows:
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/
This regex is in the default config files in the mapred branch, and will
thus make its way into the trunk soon.
Doug
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general