Re: Outlinks?

Andrzej Bialecki Mon, 07 Nov 2005 01:46:01 -0800

Karen Church wrote:

I investigated the crawler output in more detail and discovered thatfor over 90% of the pages I crawl that have outlinks one day but don'tthe next day (even though their content has not changed) - I canaccount for somewhere else in the crawl that day, i.e. the outlinkseither appear as the outlinks of another page or as the url of a pageso it looks like they aren't fetched because that have already beenfetched that day.
However, I'm still encountering some problems in understanding whathappened to the other 10%. I checked a few of the outlinks by hand andsome could not be crawled due to HTTP errors but can someone pleaseexplain why the rest of the outlinks aren't stored? Are there somestandard things I can check for? Is this normal behavior? At themoment I'm only looking in the resulting crawl segment for theseoutlinks - should I be looking somewhere else?
I'd really, really appreciate some help with this.



Hello Karen,

Outlinks should be stored in the segment, so that's the right place tolook for them.

One common source of missing outlinks is if you hit a maximum number ofoutlinks limit - but this is set to 100 by default. Another common issueis if the content parser catches an exception, then you will get apositive status for fetch, but an error in parsing, hence no outlinks.Could you use the "segread" command on these two records, and check thestatus both for the fetch and the parsing stages?


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Outlinks?

Reply via email to