Michael, On Sun, 28 Aug 2005 07:31:06 -0700 (PDT), Michael Ji wrote: > Hi Kelvin: > > 1) bot-traps problem for OC > > If we have a crawling depth for each starting host, it seems that > the crawling will be finalized in the end ( we can decrement depth > value in each time the outlink falls in same host domain). > > Let me know if my thought is wrong. >
Correct. Limiting crawls by depth is probably the simplest way of avoiding death by bot-traps. There are other methods though, like assigning credits to hosts and adapting fetchlist scheduling according to credit usage, or flagging recurring path elements as suspect. > 2) refetching > > If OC's fetchlist is online (memory residence), the next time > refetch we have to restart from seeds.txt once again. Is it right? > Maybe with the current implementation. But if you Implement a CrawlSeedSource that reads in the FetcherOutput directory in the Nutch segment, then you can seed a crawl using what's already been fetched. > 3) page content checking > > In OC API, I found an API WebDBContentSeenFilter, who uses Nutch > webdb data structure to see if the fetched page content has been > seen before. That means, we have to use Nutch to create a webdb > (maybe nutch/updatedb) in order to support this function. Is it > right? Exactly right. k
