hi Kelvin:
I believe my previous email about "further concerning
of controlled crawling" confused you a bit via my
unmatured thought. But I believe that controlled
crawling is very important for an efficient vertical
crawling application generally.
After reviewed our previous discussion, I think the
solution for bot-traps and refetching in OC might be
able to be combined as one.
1) Refetching will look at the FetcherOutput of last
run, and queue the URLs according to their domain name
(for http 1.1 protocol) as your FetcherThread does.
2) We might just count the number of URLs within the
same domain (in fly of queue?). If that number is over
centain threshold, we might think stop adding new URLs
for that domain---it is equvalent in sense of
controlled crawling, but in a way of "width".
Will it work as proposed?
thanks,
Michael Ji,
--- Kelvin Tan <[EMAIL PROTECTED]> wrote:
> Michael,
>
> On Sun, 28 Aug 2005 07:31:06 -0700 (PDT), Michael Ji
> wrote:
> > Hi Kelvin:
> >
> > 2) refetching
> >
> > If OC's fetchlist is online (memory residence),
> the next time
> > refetch we have to restart from seeds.txt once
> again. Is it right?
> >
>
> Maybe with the current implementation. But if you
> Implement a CrawlSeedSource that reads in the
> FetcherOutput directory in the Nutch segment, then
> you can seed a crawl using what's already been
> fetched.
>
>
____________________________________________________
Start your day with Yahoo! - make it your home page
http://www.yahoo.com/r/hs