Sorry.  I am still not getting this.  I understand the reason but I am 
not seeing how it works.

We inject a url directory which uses TextInputFormat and breaks the urls 
into lines.  Those urls are then filtered and scored.  If the pass 
filtering then they are injected with STATUS_INJECTED and collected by 
the mapper.  As far as I can tell that is the only input to the reduce 
function is the mapped CrawlDatums which in my mind means there can't be 
any old (not STATUS_INJECTED) CrawlDatums at that point.

The Reducer loops through the Datums replacing STATUS_INJECTED with 
STATUS_DB_UNFETCHED or using the old Datum if not STATUS_INJECTED. 
Again where do the old Datums come from?

I can understand the merge logic taking care of this to make sure it 
doesn't overwrite something already fetched, etc with a 
STATUS_DB_UNFETCHED but I am not getting where the older Datums come 
from in the Reducer.

Dennis Kubes



Andrzej Bialecki wrote:
> Gal Nitzan wrote:
>> Hi Andrzej,
>>
>> Does it mean that when you inject an existing (in crawldb) a URL it 
>> changes
>> its status to STATUS_DB_UNFETCHED?
>>
>>   
> 
> With the current version of Injector - it won't. With previous versions 
> - it might, depending on the order of values received in reduce().
> 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to