Massimo Miccoli wrote:
The general problem is urls like: http://www.agriturismo.pg.it/storia-citta-umbria/index.html
a custom not found pages that generate infinite crawler loop on site.

You're referring to error pages that do not return 404?

In another thread I just suggested a way to handle these:

http://www.mail-archive.com/nutch-user%40incubator.apache.org/msg00286.html

The url you mention is amenable to this solution. It's title contains the string "pagina di errore", but it does not return a 404.

Doug

Reply via email to