tsmori wrote:
This is strange. I manage the webservers for a large university library. On
our site we have a staff directory where each user has a location for
information. The URLs take the form of:
http://mydomain.edu/staff/userid
I've added the staff URL to the urls seed file. But even with a crawl set to
depth of 8 and unlimited files, i.e. no topN setting, the crawl still seems
to only fetch about 50% of the locations in this area of the site.
What should I look for to find out why this is happening?
* Check that the pages there are not forbidden by robot rules (which may
be embedded inside HTML meta tags of index.html, or the top-level
robots.txt).
* check that your crawldb actually contains entries for these pages -
perhaps they are being filtered out.
* check your segments whether these URLs were scheduled for fetching,
and if so, then what was the status of fetching.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com