Re: bot-traps and refetching

Kelvin Tan Sun, 28 Aug 2005 15:56:07 -0700

Michael,

On Sun, 28 Aug 2005 07:31:06 -0700 (PDT), Michael Ji wrote:
> Hi Kelvin:
>
> 1) bot-traps problem for OC
>
> If we have a crawling depth for each starting host, it seems that
> the crawling will be finalized in the end ( we can decrement depth
> value in each time the outlink falls in same host domain).
>
> Let me know if my thought is wrong.
>


Correct. Limiting crawls by depth is probably the simplest way of avoiding 
death by bot-traps. There are other methods though, like assigning credits to 
hosts and adapting fetchlist scheduling according to credit usage, or flagging 
recurring path elements as suspect.

> 2) refetching
>
> If OC's fetchlist is online (memory residence), the next time
> refetch we have to restart from seeds.txt once again. Is it right?
>

Maybe with the current implementation. But if you Implement a CrawlSeedSource 
that reads in the FetcherOutput directory in the Nutch segment, then you can 
seed a crawl using what's already been fetched.


> 3) page content checking
>
> In OC API, I found an API WebDBContentSeenFilter, who uses Nutch
> webdb data structure to see if the fetched page content has been
> seen before. That means, we have to use Nutch to create a webdb
> (maybe nutch/updatedb) in order to support this function. Is it
> right?

Exactly right.

k

Re: bot-traps and refetching

Reply via email to