I think the current form of Hadoop is not designed for stream-based processing 
where data is continuously stream-in and immediate processing (low latency) is 
required.  Please correct me if I am wrong.

The main reason is because Reduce phase cannot be started until the Map phase 
is complete.  This mandates the data stream to be broken into chunks and 
processing is conducted in a batch-oriented fashion.

But why can't we just remove the constraint and let Reduce starts before Map is 
complete.  What do we lost ?  Yes, there are something we'll lost ...

1) Keys arrived in the same reduce task is sorted.  If we start Reduce 
processing before all the data arrives, we cannot maintain the sort order 
anymore because data hasn't arrived yet.

2) If the Map process crashes in the middle of processing an input file, we 
don't know where to resume the processing.  If the Reduce process crashes, the 
result data can be lost as well.

But most of the stream-processing analytic application doesn't require the 
above.  If my reduce function is commutative and associative, then I can 
perform incremental reduce as the data stream-in.

Imagine a large social network site that is run on a server farm.  And each 
server has an agent process to track user behavior (what items is being 
searched, what photo is uploaded ... etc) across all the servers.

Lets say the social site want to analyze these user activity which comes in as 
data streams from many servers.  So I want each server running a Map process 
that emit the user key (or product key) to a group of reducers which compute 
the analytics.

Isn't this kind of processing can be run in Map/Reduce without the need for the 
Reduce to wait for the Map to be finished ?

Does it make sense ?  Am I missing something important ?

Rgds,
Ricky

Reply via email to