hi Kelvin: When we consider about refetching. Saying, we have a list of fetched URL saved locally ( in fetchedURLs?);
then, when we turn to next day, we only need to take a look at URLs in that list to do fetching. Don't necessarily need to start from seeds.txt; Then, we lost meaning of depth for individual URL, in two aspects---1) we didn't keep depth of an individual site, we only know the depth of a child site in the fly. Offcause, we could save depth value with URLs in local file, by adding a bit overhead, but then we might meet second problem 2) if the hierarchy structure of site is changed by webmaster, for example, site maintenance, or in purpose, etc. So, another idea raised in my mind: we only distinguish site by "in-domain" or "out-links"; We might a flag for each URLs saved locally; We can assume that a normal site should have a limited depth,saying 100; When we do refetching, for in-domain site, we don't care about its' original depth in the previous fetching. We just crawl it at most 100 times. Because we treat all the content in an in-domain site are valuable; For out-link site, we only fetch once and get content of its' home page; What you think? thanks, Michael Ji, __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
