Re: CrawlDb and inputDir's

Doug Cutting Tue, 31 Jan 2006 11:39:06 -0800

Stefan Groschupf wrote:

The call CrawlDb.createJob(...) creates the crawl db update job. Inthis method the main input folder is defined:
job.addInputDir(new File(crawlDb, CrawlDatum.DB_DIR_NAME));
However in the update method (line 48, 49) two more input dirs are  added.
This confuses me since theoretically I understand that the parsed dataare need to add fresh urls into the crawldb, but I'm surprises thatfirst of all both folders are added.


One is from the fetcher, the other from the parser.

The fetcher writes a CrawlDatum for each page fetched, with STATUS_FETCH_*.

The parser writes a CrawlDatum for each link found, with a STATUS_LINKED.

Secondly I can't find the code that writes crawldatum objects into thisfolders, instead I found that the fetchoutput format writes parseImpland Content into these folders.


FetcherOutputFormat line 73, and ParseOutputFormat line 107.

I also find no code where these objects are converted or merged  together.


CrawlDbReducer.reduce().

Thirdly wouldn't be cleaner to move the adding of this folders alsointo the createJob method?

No, the createJob() method is also used by the Injector, where thesedirectories are not appropriate.


Doug

Re: CrawlDb and inputDir's

Reply via email to