Thanks for your answers, Doug. The ways you presented for improving performance should solve most of the problems. However, one problem I imagine will arise with smaller fetch tasks, is that concurrency at the end of *each* task will drop dramatically because some domains contain more pages than others or slower than others. This problem will be worse with smaller tasks because there're now more tasks. So instead of a max fetch time for each task, it may make sense to just cut off segment fetching when concurrency drops below a threshold. With this one may not need small segments at all.
Regarding my original proposal, I understand that batch update of the db is important. DB update is in my understanding currently a major bottleneck. So I didn't mean updateing the DB after fetching each segment. Actually one can update the DB periodically, e.g. daily, with all finished segments during the day. The same apply to new segment generation. Presumably this could also be done with MapReduce-based fetcher. - Feng On Tue, 29 Mar 2005 09:39:46 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote: > > What about running one fetcher on each node 24/7? Each fetcher would > > take segments from a global queue. Other parts of the system do not > > have to wait untill the to-fetch queue is depleted before doing the DB > > update and new segment generation. So basically adding a queue will > > allow pipelining of the time consuming work, namely fetching, db > > update and segment generation. And we will not end up waiting for one > > or two fetchers to finish their job. > > One could do that, but I think the same effect can be achieved with (2) > above. Consider adding a db page status named FETCHING. When a fetch > list is created (step 4 of the document I posted) one avoids generating > urls whose status is FETCHING, and rewrites db entries for generated > pages to FETCHING. This status is overwritten when the db is updated > with fetcher output. If a url's status is FETCHING for more than a > specified amount of time (e.g., one week) then it will be re-generated > anyway. > > Here one bootstraps by generating multiple initial segments, and all > urls are in the db marked FETCHING. All segments submitted as fetcher > jobs in parallel. As each fetcher job completes, the database is > updated with its output and one or more new fetcher jobs are generated. > > The page db file is a queue. Updates to the queue are batched, which > optimizes i/o. Each url added to the queue must be checked against the > queue before it can generated. Representing the queue with a B-tree > would permit incremental updates, but would require log(N) disk seeks > per outlink, which is much to slow. Thus to be efficient, the queue > must buffer new urls and periodically sort and merge them with the set > of known urls. This is just a page db update. > > Is this convincing? > > Doug > ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
