You should be able to create a page yourself ... page.php 

<a href="page.php/">Page</a>

For some content, a hash of the content could get rid of duplicates, 
but that seems to  come at a later stage ... the only real solution 
I can think of to this (and similar)
problems is to limit the depth of the crawl.

Nick


-----Original Message-----
From: Handl, Jorge [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 31, 2005 10:24 AM
To: [email protected]
Subject: RE: ran into a site that sends a crawl into an infinite loop


Kamil, please send me the sites url, I would like to test my crawler for
this condition.
Thanks!

> -----Mensaje original-----
> De: Kamil Wnuk [mailto:[EMAIL PROTECTED]
> Enviado el: Martes, 30 de Agosto de 2005 19:43
> Para: [email protected]
> Asunto: ran into a site that sends a crawl into an infinite loop
> 
> 
> Hi,
> 
> In the process of a moderately sized crawl I was running, I hit a page
> that sent nutch into an infinite fetch cycle. The page that I hit
> contained relative links to itself with the syntax "/page.shtml".  So
> once the initial page was fetched, each new generated fetchlist
> contained the same url with another "/page.shtml" appended onto the
> end.  This caused nutch to fetch urls such as
> "http://www.website.com/page.shtml/page.shtml/page.shtml/page.shtml/";;
> a process which could go on indefinitely.
> 
> How can I prevent this from happening from the nutch end (I do not
> have control of the site, and such a problem could always arise
> elsewhere)?
> 
> For anyone interested in duplicating this problem, I will send you the
> page's url upon request so that the server does not get bombarded by
> too many crawlers at once.
> 
> Thanks,
> Kamil
> 



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to