Hi fellow Nutch users. Long time crawler, first time poster. :-)
We're 23m pages into a 100m page crawl and our preliminary tests have shown that a lot of pages contain our agent name, description, etc., in their page content. Meaning, sites that have a script which show http headers (typically to show browser information) causes the Nutch crawler to store its own header information within the content of that page. So when we search our index for "Isara" (our agent name) we get thousands of results and they all have "Isara/Isara-1.0 (A non-profit search engine benefiting charity.; http://www.isara.org; e-m...@removed.org", which is the content of our nutch-default.xml file: http.agent.name, http.agent.description, http.agent.url, http.agent.email, and http.agent.version .
I've searched around and haven't found any information on how to stop this from happening. Is there a solution and, if so, will it mean we need to recrawl all those pages again or can we filter the current database? Any suggestions would be greatly appreciated.
Thank you for developing such an important open-source application, Kirk Gillock Isara Charity Foundation Nong Khai, Thailand http://www.isara.org