Hello,

with a continued crawl a couple of URLs definitely fetched during the latest 
crawl
are missing in the index. The indexer is run on all segments including those 
from
previous crawler runs. The aim is to reduce the time of a daily crawler run by
avoiding to fetch content of not modified pages and making use of adaptive
(re)fetch scheduling.

The problem is caused by the way the crawled web server handles errors:
instead of an immediate 404 an redirect to an error page is sent.

If the following constellation is hit a document may get lost from the index:

 day one
   http://xyz.com/page1.aspx  (fetch_success)

 day two (server problems)
   http://xyz.com/page1.aspx  (fetch_redir_temp)
     > http://xyz.com/error.aspx  (404: fetch_gone)

 day three (server ok)
   http://xyz.com/page1.aspx  (fetch_success)

The primary sorting criterion of CrawlDatum is the score,
so if the redirected page (resp. CrawlDatum) by accident gets an higher
score than the latest one this page may get lost although
it has been fetched successfully during the last crawler run.

The following patch would solve the problem:

         else if (CrawlDatum.hasFetchStatus(datum)) {
           // don't index unmodified (empty) pages
-          if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)
-            fetchDatum = datum;
+          if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
+               // take the latest fetch datum regardless of the sorting of 
CrawlDatum
+               if (fetchDatum == null || datum.getFetchTime() >= 
fetchDatum.getFetchTime())
+                       fetchDatum = datum;
+          }
         } else if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
                    CrawlDatum.STATUS_SIGNATURE == datum.getStatus() ||
                    CrawlDatum.STATUS_PARSE_META == datum.getStatus()) {

The latest fetch datum is taken except it is fetch_notmodified.


Are there any pitfalls? The situation is somewhat complicated.

By the way:

1. Is there a reason why the fetchDatum is checked at all?
   It could be enough to take the latestes Content, ParseData, etc.
   if the current dbDatum is db_fetched or db_notmodified (or ...)

2. What about the SegmentMerger? The reduce function of both Indexer and 
SegmentMerger
   should behave similar if not identical. I had a look: the SegmentMerger, 
apparently keeps
   the latest fetchDatum (determined by the segment name/time-stamp). It does 
not check
   for fetch_notmodified. I didn't run a test if this definitely leads to lost 
documents.

Regards and thanks,

Sebastian

P.S.: Of course, I agree that sending a redirect in the case of a temporary 
server failure is not
best practice. But I cannot change...

Reply via email to