Hi, In the process of a moderately sized crawl I was running, I hit a page that sent nutch into an infinite fetch cycle. The page that I hit contained relative links to itself with the syntax "/page.shtml". So once the initial page was fetched, each new generated fetchlist contained the same url with another "/page.shtml" appended onto the end. This caused nutch to fetch urls such as "http://www.website.com/page.shtml/page.shtml/page.shtml/page.shtml/"; a process which could go on indefinitely.
How can I prevent this from happening from the nutch end (I do not have control of the site, and such a problem could always arise elsewhere)? For anyone interested in duplicating this problem, I will send you the page's url upon request so that the server does not get bombarded by too many crawlers at once. Thanks, Kamil ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
