Thanks for the clarification, i missed all this cross links!
You definitely 'are in the know'. :-)
Stefan
Am 31.01.2006 um 20:31 schrieb Doug Cutting:
Stefan Groschupf wrote:
The call CrawlDb.createJob(...) creates the crawl db update job.
In this method the main input folder is defined:
job.addInputDir(new File(crawlDb, CrawlDatum.DB_DIR_NAME));
However in the update method (line 48, 49) two more input dirs
are added.
This confuses me since theoretically I understand that the parsed
data are need to add fresh urls into the crawldb, but I'm
surprises that first of all both folders are added.
One is from the fetcher, the other from the parser.
The fetcher writes a CrawlDatum for each page fetched, with
STATUS_FETCH_*.
The parser writes a CrawlDatum for each link found, with a
STATUS_LINKED.
Secondly I can't find the code that writes crawldatum objects
into this folders, instead I found that the fetchoutput format
writes parseImpl and Content into these folders.
FetcherOutputFormat line 73, and ParseOutputFormat line 107.
I also find no code where these objects are converted or merged
together.
CrawlDbReducer.reduce().
Thirdly wouldn't be cleaner to move the adding of this folders
also into the createJob method?
No, the createJob() method is also used by the Injector, where
these directories are not appropriate.
Doug