There is a paper on this:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Thu, Nov 5, 2009 at 4:33 PM, Ricky Ho <[email protected]> wrote:

> I think the current form of Hadoop is not designed for stream-based
> processing where data is continuously stream-in and immediate processing
> (low latency) is required.  Please correct me if I am wrong.
>
> The main reason is because Reduce phase cannot be started until the Map
> phase is complete.  This mandates the data stream to be broken into chunks
> and processing is conducted in a batch-oriented fashion.
>
> But why can't we just remove the constraint and let Reduce starts before
> Map is complete.  What do we lost ?  Yes, there are something we'll lost ...
>
> 1) Keys arrived in the same reduce task is sorted.  If we start Reduce
> processing before all the data arrives, we cannot maintain the sort order
> anymore because data hasn't arrived yet.
>
> 2) If the Map process crashes in the middle of processing an input file, we
> don't know where to resume the processing.  If the Reduce process crashes,
> the result data can be lost as well.
>
> But most of the stream-processing analytic application doesn't require the
> above.  If my reduce function is commutative and associative, then I can
> perform incremental reduce as the data stream-in.
>
> Imagine a large social network site that is run on a server farm.  And each
> server has an agent process to track user behavior (what items is being
> searched, what photo is uploaded ... etc) across all the servers.
>
> Lets say the social site want to analyze these user activity which comes in
> as data streams from many servers.  So I want each server running a Map
> process that emit the user key (or product key) to a group of reducers which
> compute the analytics.
>
> Isn't this kind of processing can be run in Map/Reduce without the need for
> the Reduce to wait for the Map to be finished ?
>
> Does it make sense ?  Am I missing something important ?
>
> Rgds,
> Ricky
>

Reply via email to