Nutch internals

Uroš Gruber Tue, 29 Aug 2006 05:12:25 -0700

Hi,

I do some changes in CrawlDatum but some things I'm not quite understand.

My idea is to add int hop in CrawlDatum and set this in Injector to 0.Then after fetching other urls this can be calculated parenturl + 1.

I try to find where adding new urls to webDB is done. If somebody couldexplain this to me.

1. Inject (urls are read from url file, filtered through enabled Filtersand stored in WebDB)2. after that generate is started. Here WebDB is read in create somelist of urls to fetch

3. Fetcher fetch urls and store this in segments dirs

4. updatedb, If I understand correctly data from segment/*/crawl_parseis merged with current WebDB. If so creating webdb in segment is donewhen fetching.

I think it's possible to get fetching url CrawlDatum info while fetchingand then use hop number to calculate with all other urls found oncurrent page and store this.


Maybe I missed the whole concept of this.

Affter that I can use this hop number to limit generating fetch lists.

regards

Uros

Nutch internals

Reply via email to