[EMAIL PROTECTED] wrote:
Hi All,

I think I am missing something.  In the Injector reduce code we have the
following.

------------------------------------------------------------------------
while (values.hasNext()) {
  CrawlDatum val = (CrawlDatum)values.next();
  if (val.getStatus() == CrawlDatum.STATUS_INJECTED) {
    injected = val;
    injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
  } else {
    old = val;
  }
}

CrawlDatum res = null;
if (old != null) res = old; // don't overwrite existing value
else res = injected;
------------------------------------------------------------------------

Basically if it is not just injected then don't overwrite.  But I am not
seeing where the input could be such that the CrawlDatum wasn't just
injected and could have previous values.  Is this just in case someone
uses the Injector as a Reducer and not a Mapper or am I missing how this
condition can occur.

This handles an important case, when you inject URLs that already exist in the DB - then you have both the old value and the newly created value under the same key. In previous versions of Injector CrawlDatum-s for such URLs could be overwritten with new values, and you could lose valuable metadata accumulated in old values.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to