Thanks Doug,
I will try. But no other way to detect crawler loop? How we can discover similar case with billion of pages?


Thanks
Massimo

Doug Cutting wrote:

Massimo Miccoli wrote:

The general problem is urls like: http://www.agriturismo.pg.it/storia-citta-umbria/index.html
a custom not found pages that generate infinite crawler loop on site.


You're referring to error pages that do not return 404?

In another thread I just suggested a way to handle these:

http://www.mail-archive.com/nutch-user%40incubator.apache.org/msg00286.html


The url you mention is amenable to this solution. It's title contains the string "pagina di errore", but it does not return a 404.


Doug


-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers




-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to