Andrzej Bialecki wrote:
Uroš Gruber wrote:
I made some draft patch. But there is still some problems I see. I
know code needs to be cleaned and test. But right now I don't know
what number set to external urls. For internal linking works great.
(the patch changes CrawlDatum itself, I think it would be better to
put the hop counter in CrawlDatum.metaData.)
I can try to make with metaData
What is the whole idea of this changes.
Injected urls always get hop 0. While fetching/updating/generating
hop value is incremented by 1. (still no idea what to do with
external link). Then I can add config value max_hop etc. to limit
fetcher and generator to create more urls.
This way it's possible to limit crawling vertically
Comments are welcome.
Well, it really depends on what you want to do when you encounter an
external link. Do you want to restart the counter, i.e. crawl the new
site at full depth up to max_hop? Then set hop=0. Do you want to
terminate the crawl at that link? then set hop=max_hop.
I talk with my friend about this and here is what we've came up. Let say
URLs manualy injected are good and checked by human and probably you
wan't to start from it. So setting hop to 0 at injection is ok. While
crawling we have some sort of filtering by host (regexp etc.). We need
no worry about urls we don't have in our list so hop can be set whatever
it's, maybe to max_hop.
But here a scenario We add foo.com and bar.com from injection. After
crawling we find on site foo.com link to bar.com/hop/hop/index.html
We can set url hop to 0 or to max because we can update this after we
found this url on bar.com site.
Checking for hop needs to be done while updating I think, so we don't
end up with bunch of urls having hop greater than max_hop.
I will try to make a decent patch for this to check and if there is any
idea by others please make a comment on this.
regards
Uros