Karen Church wrote:
I investigated the crawler output in more detail and discovered that
for over 90% of the pages I crawl that have outlinks one day but don't
the next day (even though their content has not changed) - I can
account for somewhere else in the crawl that day, i.e. the outlinks
either appear as the outlinks of another page or as the url of a page
so it looks like they aren't fetched because that have already been
fetched that day.
However, I'm still encountering some problems in understanding what
happened to the other 10%. I checked a few of the outlinks by hand and
some could not be crawled due to HTTP errors but can someone please
explain why the rest of the outlinks aren't stored? Are there some
standard things I can check for? Is this normal behavior? At the
moment I'm only looking in the resulting crawl segment for these
outlinks - should I be looking somewhere else?
I'd really, really appreciate some help with this.
Hello Karen,
Outlinks should be stored in the segment, so that's the right place to
look for them.
One common source of missing outlinks is if you hit a maximum number of
outlinks limit - but this is set to 100 by default. Another common issue
is if the content parser catches an exception, then you will get a
positive status for fetch, but an error in parsing, hence no outlinks.
Could you use the "segread" command on these two records, and check the
status both for the fetch and the parsing stages?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com