You should be able to create a page yourself ... page.php <a href="page.php/">Page</a>
For some content, a hash of the content could get rid of duplicates, but that seems to come at a later stage ... the only real solution I can think of to this (and similar) problems is to limit the depth of the crawl. Nick -----Original Message----- From: Handl, Jorge [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 31, 2005 10:24 AM To: [email protected] Subject: RE: ran into a site that sends a crawl into an infinite loop Kamil, please send me the sites url, I would like to test my crawler for this condition. Thanks! > -----Mensaje original----- > De: Kamil Wnuk [mailto:[EMAIL PROTECTED] > Enviado el: Martes, 30 de Agosto de 2005 19:43 > Para: [email protected] > Asunto: ran into a site that sends a crawl into an infinite loop > > > Hi, > > In the process of a moderately sized crawl I was running, I hit a page > that sent nutch into an infinite fetch cycle. The page that I hit > contained relative links to itself with the syntax "/page.shtml". So > once the initial page was fetched, each new generated fetchlist > contained the same url with another "/page.shtml" appended onto the > end. This caused nutch to fetch urls such as > "http://www.website.com/page.shtml/page.shtml/page.shtml/page.shtml/"; > a process which could go on indefinitely. > > How can I prevent this from happening from the nutch end (I do not > have control of the site, and such a problem could always arise > elsewhere)? > > For anyone interested in duplicating this problem, I will send you the > page's url upon request so that the server does not get bombarded by > too many crawlers at once. > > Thanks, > Kamil > ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
