Re: Nutch randomly skipping locations during crawl

Andrzej Bialecki Thu, 01 Oct 2009 09:16:25 -0700

tsmori wrote:

This is strange. I manage the webservers for a large university library. On
our site we have a staff directory where each user has a location for
information. The URLs take the form of:


http://mydomain.edu/staff/userid

I've added the staff URL to the urls seed file. But even with a crawl set to
depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems

to only fetch about 50% of the locations in this area of the site.

What should I look for to find out why this is happening?

* Check that the pages there are not forbidden by robot rules (which maybe embedded inside HTML meta tags of index.html, or the top-levelrobots.txt).

* check that your crawldb actually contains entries for these pages -perhaps they are being filtered out.

* check your segments whether these URLs were scheduled for fetching,and if so, then what was the status of fetching.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Nutch randomly skipping locations during crawl

Reply via email to