Michael,

On Sun, 28 Aug 2005 08:31:29 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> Just a curious question.
>
> As I know, the goal of nutch global crawling ability will reach 10
> billions page based on implementation of map reduced.
>
> OC, seeming to fall in the middle, is for control industry domain
> crawling. How many sites is its' goal?dealing with couple of
> thousand sites?
>

The goal of OC is to facilitate focused crawling. I see at least 2 kinds of 
focused crawling:

1. Whole-web focused crawling, like spidering all pages/sites on WWW related to 
research publications on leukemia
2. Crawling a given list of URLs/sites compreh, like Teleport Pro.

Although OC was designed with scenario #2 in mind, I think it would also be 
suitable for scenario #1.

If size of crawl is a concern, I don't think it'd be difficult to build in a 
throttling mechanism to ensure that the in-memory data structures don't get too 
large.
I've been travelling around alot lately, so I haven't had a chance to test OC 
on crawls > 200k pages.


> I believe the importance for industry domain crawling is in-time
> updating. So identifying content of fetched page and saving post-
> parsing time is critical.
>

I agree. High on my todo list are:

1. Refetch using if-modified-since
2. Using an alternate link extractor if nekohtml ends up to be a bottleneck
3. Parsing downloaded pages to extract data into databases to facilitate 
aggregation, like defining a site template to map HTML pages to database 
columns (think job sites for example).
4. Move post-fetch processing into a separate thread if it turns out to be a 
bottleneck

k

Reply via email to