Re: Limiting crawl depth using NUTCH-84) Fetcher for constrained crawls

Kelvin Tan Sat, 27 Aug 2005 22:46:44 -0700

Hey Michael,

On Sat, 27 Aug 2005 21:13:27 -0700 (PDT), Michael Ji wrote:
> Hi Kelvin:
>
> I started to dig into the code and data structure of OC;
>
> Just curious questions:
>
> 1) Where OC forms a fetchlist? I didn't see it in segment/ of OC
> created.


Crawls are seeded using a CrawlSeedSource. These urls are injected into the 
respective FetcherThread's fetchlists.

After the initial seed, URLs are added to fetchlists from the parsed pages 
outlinks. OC builds the fetchlist online vs nutch's offline fetchlist building.

>
> 2) In OC, FetchList is organized in such a way of URLs sequence per
> host. Then, what if there are too many hosts, saying ten thousand.
> How about I/O performance concern? Will it exceed the system open-
> file limitation?
>

The concern, if any, would be memory, not i/o, because DefaultFetchList 
currently stores everything in memory. Still, its an interface, and simple for 
someone to implement a fetchlist that has bounds on memory, persisting to disk 
where appropriate.

k

Re: Limiting crawl depth using NUTCH-84) Fetcher for constrained crawls

Reply via email to