a further concern of refetching scenario about depth controlled crawling

Michael Ji Mon, 29 Aug 2005 17:35:20 -0700

hi Kelvin:

When we consider about refetching. Saying, we have a
list of fetched URL saved locally ( in fetchedURLs?);


then, when we turn to next day, we only need to take a
look at URLs in that list to do fetching. Don't
necessarily need to start from seeds.txt;

Then, we lost meaning of depth for individual URL, in
two aspects---1) we didn't keep depth of an individual
site, we only know the depth of a child site in the
fly. Offcause, we could save depth value with URLs in
local file, by adding a bit overhead, but then we
might meet second problem 2) if the hierarchy
structure of site is changed by webmaster, for
example, site maintenance, or in purpose, etc.

So, another idea raised in my mind: we only
distinguish site by "in-domain" or "out-links"; We
might a flag for each URLs saved locally; We can
assume that a normal site should have a limited
depth,saying 100; 

When we do refetching, for in-domain site, we don't
care about its' original depth in the previous
fetching. We just crawl it at most 100 times. Because
we treat all the content in an in-domain site are
valuable; For out-link site, we only fetch once and
get content of its' home page;

What you think?

thanks,

Michael Ji,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

a further concern of refetching scenario about depth controlled crawling

Reply via email to