Hi,

there is something more that confuse me and it would be great to get some hints. The call CrawlDb.createJob(...) creates the crawl db update job. In this method the main input folder is defined:
job.addInputDir(new File(crawlDb, CrawlDatum.DB_DIR_NAME));
However in the update method (line 48, 49) two more input dirs are added.

This confuses me since theoretically I understand that the parsed data are need to add fresh urls into the crawldb, but I'm surprises that first of all both folders are added. Secondly I can't find the code that writes crawldatum objects into this folders, instead I found that the fetchoutput format writes parseImpl and Content into these folders. I also find no code where these objects are converted or merged together. So I'm asking myself why these folders are added and where and how the fresh crawlDatum objects come from that will be merged into the new crawlDb.

Thirdly wouldn't be cleaner to move the adding of this folders also into the createJob method?

Thanks for any hints.
Stefan

-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to