A question about the fetching MapReduce process: Is it possible that some segments will happen to be slower than others and thus will prevent the whole job from finishing? It seems that the problem will probably get worse with more fetch nodes, which is what we're aiming at.
What about running one fetcher on each node 24/7? Each fetcher would take segments from a global queue. Other parts of the system do not have to wait untill the to-fetch queue is depleted before doing the DB update and new segment generation. So basically adding a queue will allow pipelining of the time consuming work, namely fetching, db update and segment generation. And we will not end up waiting for one or two fetchers to finish their job. - Feng Zhou Grad Student, CS, UC Berkeley On Mon, 28 Mar 2005 11:36:47 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote: > A few weeks ago I drafted the attached document, discussing how > MapReduce might be used in Nutch. This is an incomplete, exploratory > document, not a final design. Most of Nutch's file formats are altered. > Every operation is implemented with MapReduce. To run things on a > single machine we can automatically start a job tracker one or more task > trackers, all running in the same JVM. Hopefully this will not be much > slower than the current implementation running on a single machine. > > Comments? > > Doug ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
