[Nutch-dev] CrawlDb and inputDir's

Stefan Groschupf Sun, 29 Jan 2006 17:34:07 -0800

Hi,

there is something more that confuse me and it would be great to getsome hints.The call CrawlDb.createJob(...) creates the crawl db update job. Inthis method the main input folder is defined:

job.addInputDir(new File(crawlDb, CrawlDatum.DB_DIR_NAME));

However in the update method (line 48, 49) two more input dirs areadded.

This confuses me since theoretically I understand that the parseddata are need to add fresh urls into the crawldb, but I'm surprisesthat first of all both folders are added.Secondly I can't find the code that writes crawldatum objects intothis folders, instead I found that the fetchoutput format writesparseImpl and Content into these folders.I also find no code where these objects are converted or mergedtogether.So I'm asking myself why these folders are added and where and howthe fresh crawlDatum objects come from that will be merged into thenew crawlDb.

Thirdly wouldn't be cleaner to move the adding of this folders alsointo the createJob method?


Thanks for any hints.

Stefan


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] CrawlDb and inputDir's

Reply via email to