Hi,

I do some changes in CrawlDatum but some things I'm not quite understand.

My idea is to add int hop in CrawlDatum and set this in Injector to 0. Then after fetching other urls this can be calculated parenturl + 1.

I try to find where adding new urls to webDB is done. If somebody could explain this to me.

1. Inject (urls are read from url file, filtered through enabled Filters and stored in WebDB) 2. after that generate is started. Here WebDB is read in create some list of urls to fetch
3. Fetcher fetch urls and store this in segments dirs

4. updatedb, If I understand correctly data from segment/*/crawl_parse is merged with current WebDB. If so creating webdb in segment is done when fetching.

I think it's possible to get fetching url CrawlDatum info while fetching and then use hop number to calculate with all other urls found on current page and store this.

Maybe I missed the whole concept of this.

Affter that I can use this hop number to limit generating fetch lists.

regards

Uros

Reply via email to