Sorry. I am still not getting this. I understand the reason but I am
not seeing how it works.
We inject a url directory which uses TextInputFormat and breaks the urls
into lines. Those urls are then filtered and scored. If the pass
filtering then they are injected with STATUS_INJECTED and collected by
the mapper. As far as I can tell that is the only input to the reduce
function is the mapped CrawlDatums which in my mind means there can't be
any old (not STATUS_INJECTED) CrawlDatums at that point.
The Reducer loops through the Datums replacing STATUS_INJECTED with
STATUS_DB_UNFETCHED or using the old Datum if not STATUS_INJECTED.
Again where do the old Datums come from?
I can understand the merge logic taking care of this to make sure it
doesn't overwrite something already fetched, etc with a
STATUS_DB_UNFETCHED but I am not getting where the older Datums come
from in the Reducer.
Dennis Kubes
Andrzej Bialecki wrote:
Gal Nitzan wrote:
Hi Andrzej,
Does it mean that when you inject an existing (in crawldb) a URL it
changes
its status to STATUS_DB_UNFETCHED?
With the current version of Injector - it won't. With previous versions
- it might, depending on the order of values received in reduce().