Hi Andrzej,

Does it mean that when you inject an existing (in crawldb) a URL it changes
its status to STATUS_DB_UNFETCHED?

Gal

-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 15, 2007 8:47 AM
To: nutch-dev@lucene.apache.org
Subject: Re: Injector checking for other than STATUS_INJECTED

[EMAIL PROTECTED] wrote:
> Hi All,
>
> I think I am missing something.  In the Injector reduce code we have the
> following.
>
> ------------------------------------------------------------------------
> while (values.hasNext()) {
>   CrawlDatum val = (CrawlDatum)values.next();
>   if (val.getStatus() == CrawlDatum.STATUS_INJECTED) {
>     injected = val;
>     injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
>   } else {
>     old = val;
>   }
> }
>
> CrawlDatum res = null;
> if (old != null) res = old; // don't overwrite existing value
> else res = injected;
> ------------------------------------------------------------------------
>
> Basically if it is not just injected then don't overwrite.  But I am not
> seeing where the input could be such that the CrawlDatum wasn't just
> injected and could have previous values.  Is this just in case someone
> uses the Injector as a Reducer and not a Mapper or am I missing how this
> condition can occur.
>   

This handles an important case, when you inject URLs that already exist 
in the DB - then you have both the old value and the newly created value 
under the same key. In previous versions of Injector CrawlDatum-s for 
such URLs could be overwritten with new values, and you could lose 
valuable metadata accumulated in old values.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to