Stefan Groschupf wrote:
The call CrawlDb.createJob(...) creates the crawl db update job. In
this method the main input folder is defined:
job.addInputDir(new File(crawlDb, CrawlDatum.DB_DIR_NAME));
However in the update method (line 48, 49) two more input dirs are added.
This confuses me since theoretically I understand that the parsed data
are need to add fresh urls into the crawldb, but I'm surprises that
first of all both folders are added.
One is from the fetcher, the other from the parser.
The fetcher writes a CrawlDatum for each page fetched, with STATUS_FETCH_*.
The parser writes a CrawlDatum for each link found, with a STATUS_LINKED.
Secondly I can't find the code that writes crawldatum objects into this
folders, instead I found that the fetchoutput format writes parseImpl
and Content into these folders.
FetcherOutputFormat line 73, and ParseOutputFormat line 107.
I also find no code where these objects are converted or merged together.
CrawlDbReducer.reduce().
Thirdly wouldn't be cleaner to move the adding of this folders also
into the createJob method?
No, the createJob() method is also used by the Injector, where these
directories are not appropriate.
Doug