There is a paper on this: http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html
Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Thu, Nov 5, 2009 at 4:33 PM, Ricky Ho <[email protected]> wrote: > I think the current form of Hadoop is not designed for stream-based > processing where data is continuously stream-in and immediate processing > (low latency) is required. Please correct me if I am wrong. > > The main reason is because Reduce phase cannot be started until the Map > phase is complete. This mandates the data stream to be broken into chunks > and processing is conducted in a batch-oriented fashion. > > But why can't we just remove the constraint and let Reduce starts before > Map is complete. What do we lost ? Yes, there are something we'll lost ... > > 1) Keys arrived in the same reduce task is sorted. If we start Reduce > processing before all the data arrives, we cannot maintain the sort order > anymore because data hasn't arrived yet. > > 2) If the Map process crashes in the middle of processing an input file, we > don't know where to resume the processing. If the Reduce process crashes, > the result data can be lost as well. > > But most of the stream-processing analytic application doesn't require the > above. If my reduce function is commutative and associative, then I can > perform incremental reduce as the data stream-in. > > Imagine a large social network site that is run on a server farm. And each > server has an agent process to track user behavior (what items is being > searched, what photo is uploaded ... etc) across all the servers. > > Lets say the social site want to analyze these user activity which comes in > as data streams from many servers. So I want each server running a Map > process that emit the user key (or product key) to a group of reducers which > compute the analytics. > > Isn't this kind of processing can be run in Map/Reduce without the need for > the Reduce to wait for the Map to be finished ? > > Does it make sense ? Am I missing something important ? > > Rgds, > Ricky >
