Thanks for the clarification, i missed all this cross links!
You definitely 'are in the know'. :-)
Stefan



Am 31.01.2006 um 20:31 schrieb Doug Cutting:

Stefan Groschupf wrote:
The call CrawlDb.createJob(...) creates the crawl db update job. In this method the main input folder is defined:
job.addInputDir(new File(crawlDb, CrawlDatum.DB_DIR_NAME));
However in the update method (line 48, 49) two more input dirs are added. This confuses me since theoretically I understand that the parsed data are need to add fresh urls into the crawldb, but I'm surprises that first of all both folders are added.

One is from the fetcher, the other from the parser.

The fetcher writes a CrawlDatum for each page fetched, with STATUS_FETCH_*.

The parser writes a CrawlDatum for each link found, with a STATUS_LINKED.

Secondly I can't find the code that writes crawldatum objects into this folders, instead I found that the fetchoutput format writes parseImpl and Content into these folders.

FetcherOutputFormat line 73, and ParseOutputFormat line 107.

I also find no code where these objects are converted or merged together.

CrawlDbReducer.reduce().

Thirdly wouldn't be cleaner to move the adding of this folders also into the createJob method?

No, the createJob() method is also used by the Injector, where these directories are not appropriate.

Doug


Reply via email to