Sorry. I am still not getting this. I understand the reason but I am not seeing how it works.

We inject a url directory which uses TextInputFormat and breaks the urls into lines. Those urls are then filtered and scored. If the pass filtering then they are injected with STATUS_INJECTED and collected by the mapper. As far as I can tell that is the only input to the reduce function is the mapped CrawlDatums which in my mind means there can't be any old (not STATUS_INJECTED) CrawlDatums at that point.

The Reducer loops through the Datums replacing STATUS_INJECTED with STATUS_DB_UNFETCHED or using the old Datum if not STATUS_INJECTED. Again where do the old Datums come from?

I can understand the merge logic taking care of this to make sure it doesn't overwrite something already fetched, etc with a STATUS_DB_UNFETCHED but I am not getting where the older Datums come from in the Reducer.

Dennis Kubes



Andrzej Bialecki wrote:
Gal Nitzan wrote:
Hi Andrzej,

Does it mean that when you inject an existing (in crawldb) a URL it changes
its status to STATUS_DB_UNFETCHED?


With the current version of Injector - it won't. With previous versions - it might, depending on the order of values received in reduce().

Reply via email to