Re: HTTP Header problem

Kirk Gillock Sat, 05 Dec 2009 09:55:32 -0800

Thank you for the quick reply, Dennis. It was worth a shot. :-)

People are not typically searching for our own name on our own site but,in case it did happen, we wanted to have the results be as clean aspossible. For our next crawls we'll change the agent name and version tosomething else.


Thanks again,
Kirk


Dennis Kubes wrote:

There isn't a way to stop this from happening really except to changethe agent name in the Nutch configuration. When an http request ismade, the agent name is sent as a header. There are many pages as yousay that simply have logs of different user-agents hitting their sitesor have a script to spit back the user agent when a crawler is detected.
Dennis

Kirk Gillock wrote:
Hi fellow Nutch users.

Long time crawler, first time poster. :-)
We're 23m pages into a 100m page crawl and our preliminary tests haveshown that a lot of pages contain our agent name, description, etc.,in their page content. Meaning, sites that have a script which showhttp headers (typically to show browser information) causes the Nutchcrawler to store its own header information within the content ofthat page. So when we search our index for "Isara" (our agent name)we get thousands of results and they all have "Isara/Isara-1.0 (Anon-profit search engine benefiting charity.; http://www.isara.org;e-m...@removed.org", which is the content of our nutch-default.xmlfile: http.agent.name, http.agent.description, http.agent.url,http.agent.email, and http.agent.version .
I've searched around and haven't found any information on how to stopthis from happening. Is there a solution and, if so, will it mean weneed to recrawl all those pages again or can we filter the currentdatabase? Any suggestions would be greatly appreciated.
Thank you for developing such an important open-source application,
Kirk Gillock
Isara Charity Foundation
Nong Khai, Thailand
http://www.isara.org
------------------------------------------------------------------------


No virus found in this incoming message.
Checked by AVG - www.avg.comVersion: 8.5.426 / Virus Database: 270.14.95/2546 - Release Date: 12/05/09 08:13:00

Re: HTTP Header problem

Reply via email to