Sorry. I am still not getting this. I understand the reason but I am not seeing how it works.
We inject a url directory which uses TextInputFormat and breaks the urls into lines. Those urls are then filtered and scored. If the pass filtering then they are injected with STATUS_INJECTED and collected by the mapper. As far as I can tell that is the only input to the reduce function is the mapped CrawlDatums which in my mind means there can't be any old (not STATUS_INJECTED) CrawlDatums at that point. The Reducer loops through the Datums replacing STATUS_INJECTED with STATUS_DB_UNFETCHED or using the old Datum if not STATUS_INJECTED. Again where do the old Datums come from? I can understand the merge logic taking care of this to make sure it doesn't overwrite something already fetched, etc with a STATUS_DB_UNFETCHED but I am not getting where the older Datums come from in the Reducer. Dennis Kubes Andrzej Bialecki wrote: > Gal Nitzan wrote: >> Hi Andrzej, >> >> Does it mean that when you inject an existing (in crawldb) a URL it >> changes >> its status to STATUS_DB_UNFETCHED? >> >> > > With the current version of Injector - it won't. With previous versions > - it might, depending on the order of values received in reduce(). > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers