Michael, On Sun, 28 Aug 2005 08:31:29 -0700 (PDT), Michael Ji wrote: > hi Kelvin: > > Just a curious question. > > As I know, the goal of nutch global crawling ability will reach 10 > billions page based on implementation of map reduced. > > OC, seeming to fall in the middle, is for control industry domain > crawling. How many sites is its' goal?dealing with couple of > thousand sites? >
The goal of OC is to facilitate focused crawling. I see at least 2 kinds of focused crawling: 1. Whole-web focused crawling, like spidering all pages/sites on WWW related to research publications on leukemia 2. Crawling a given list of URLs/sites compreh, like Teleport Pro. Although OC was designed with scenario #2 in mind, I think it would also be suitable for scenario #1. If size of crawl is a concern, I don't think it'd be difficult to build in a throttling mechanism to ensure that the in-memory data structures don't get too large. I've been travelling around alot lately, so I haven't had a chance to test OC on crawls > 200k pages. > I believe the importance for industry domain crawling is in-time > updating. So identifying content of fetched page and saving post- > parsing time is critical. > I agree. High on my todo list are: 1. Refetch using if-modified-since 2. Using an alternate link extractor if nekohtml ends up to be a bottleneck 3. Parsing downloaded pages to extract data into databases to facilitate aggregation, like defining a site template to map HTML pages to database columns (think job sites for example). 4. Move post-fetch processing into a separate thread if it turns out to be a bottleneck k
