tsmori wrote:
This is strange. I manage the webservers for a large university library. On
our site we have a staff directory where each user has a location for
information. The URLs take the form of:

http://mydomain.edu/staff/userid

I've added the staff URL to the urls seed file. But even with a crawl set to
depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems
to only fetch about 50% of the locations in this area of the site.
What should I look for to find out why this is happening?



* Check that the pages there are not forbidden by robot rules (which may be embedded inside HTML meta tags of index.html, or the top-level robots.txt).

* check that your crawldb actually contains entries for these pages - perhaps they are being filtered out.

* check your segments whether these URLs were scheduled for fetching, and if so, then what was the status of fetching.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to