Re: [Fwd: Re: get CrawlDatum]

Uroš Gruber Wed, 06 Sep 2006 22:54:14 -0700

Andrzej Bialecki wrote:

Uroš Gruber wrote:
I made some draft patch. But there is still some problems I see. Iknow code needs to be cleaned and test. But right now I don't knowwhat number set to external urls. For internal linking works great.
(the patch changes CrawlDatum itself, I think it would be better toput the hop counter in CrawlDatum.metaData.)

I can try to make with metaData

What is the whole idea of this changes.
Injected urls always get hop 0. While fetching/updating/generatinghop value is incremented by 1. (still no idea what to do withexternal link). Then I can add config value max_hop etc. to limitfetcher and generator to create more urls.
This way it's possible to limit crawling vertically

Comments are welcome.
Well, it really depends on what you want to do when you encounter anexternal link. Do you want to restart the counter, i.e. crawl the newsite at full depth up to max_hop? Then set hop=0. Do you want toterminate the crawl at that link? then set hop=max_hop.

I talk with my friend about this and here is what we've came up. Let sayURLs manualy injected are good and checked by human and probably youwan't to start from it. So setting hop to 0 at injection is ok. Whilecrawling we have some sort of filtering by host (regexp etc.). We needno worry about urls we don't have in our list so hop can be set whateverit's, maybe to max_hop.

But here a scenario We add foo.com and bar.com from injection. Aftercrawling we find on site foo.com link to bar.com/hop/hop/index.htmlWe can set url hop to 0 or to max because we can update this after wefound this url on bar.com site.

Checking for hop needs to be done while updating I think, so we don'tend up with bunch of urls having hop greater than max_hop.

I will try to make a decent patch for this to check and if there is anyidea by others please make a comment on this.


regards

Uros

Re: [Fwd: Re: get CrawlDatum]

Reply via email to