There isn't a way to stop this from happening really except to change
the agent name in the Nutch configuration. When an http request is
made, the agent name is sent as a header. There are many pages as you
say that simply have logs of different user-agents hitting their sites
or have a script to spit back the user agent when a crawler is detected.
Dennis
Kirk Gillock wrote:
Hi fellow Nutch users.
Long time crawler, first time poster. :-)
We're 23m pages into a 100m page crawl and our preliminary tests have
shown that a lot of pages contain our agent name, description, etc., in
their page content. Meaning, sites that have a script which show http
headers (typically to show browser information) causes the Nutch crawler
to store its own header information within the content of that page. So
when we search our index for "Isara" (our agent name) we get thousands
of results and they all have "Isara/Isara-1.0 (A non-profit search
engine benefiting charity.; http://www.isara.org; e-m...@removed.org",
which is the content of our nutch-default.xml file: http.agent.name,
http.agent.description, http.agent.url, http.agent.email, and
http.agent.version .
I've searched around and haven't found any information on how to stop
this from happening. Is there a solution and, if so, will it mean we
need to recrawl all those pages again or can we filter the current
database? Any suggestions would be greatly appreciated.
Thank you for developing such an important open-source application,
Kirk Gillock
Isara Charity Foundation
Nong Khai, Thailand
http://www.isara.org