Karen Church wrote:

I investigated the crawler output in more detail and discovered that for over 90% of the pages I crawl that have outlinks one day but don't the next day (even though their content has not changed) - I can account for somewhere else in the crawl that day, i.e. the outlinks either appear as the outlinks of another page or as the url of a page so it looks like they aren't fetched because that have already been fetched that day.

However, I'm still encountering some problems in understanding what happened to the other 10%. I checked a few of the outlinks by hand and some could not be crawled due to HTTP errors but can someone please explain why the rest of the outlinks aren't stored? Are there some standard things I can check for? Is this normal behavior? At the moment I'm only looking in the resulting crawl segment for these outlinks - should I be looking somewhere else?

I'd really, really appreciate some help with this.


Hello Karen,

Outlinks should be stored in the segment, so that's the right place to look for them.

One common source of missing outlinks is if you hit a maximum number of outlinks limit - but this is set to 100 by default. Another common issue is if the content parser catches an exception, then you will get a positive status for fetch, but an error in parsing, hence no outlinks. Could you use the "segread" command on these two records, and check the status both for the fetch and the parsing stages?

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to