I will try. But no other way to detect crawler loop? How we can discover similar case with billion of pages?
Thanks Massimo
Doug Cutting wrote:
Massimo Miccoli wrote:
The general problem is urls like: http://www.agriturismo.pg.it/storia-citta-umbria/index.html
a custom not found pages that generate infinite crawler loop on site.
You're referring to error pages that do not return 404?
In another thread I just suggested a way to handle these:
http://www.mail-archive.com/nutch-user%40incubator.apache.org/msg00286.html
The url you mention is amenable to this solution. It's title contains the string "pagina di errore", but it does not return a 404.
Doug
-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
