split reduce compute phase ioto two threads one for reading and another for 
computing
-------------------------------------------------------------------------------------

                 Key: MAPREDUCE-1939
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1939
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
    Affects Versions: 0.20.2
            Reporter: wangxiaowei
             Fix For: 0.20.2


it is known that  reduce task is made up of three phases: shuffle , sort and 
reduce. During reduce phase, a reduce function will read a record from disk or 
memory first and process it to write to hdfs finally. To convert this serial 
progress to parallel progress , I split the reduce phase into two threads 
called producer and consumer individually. producer is used to read record from 
disk and consumer to process the records read by the first one. I use two 
buffer, if  producer is writing one buffer consumer will read from another 
buffer.  Theoretically  there will be a overlap between this two phases so we 
can reduce the whole reduce time.

I wonder why hadoop does not implement it originally? Is there some potential 
problems for such ideas ?

I have already implemmented a prototypy. The producer just reads bytes from the 
disk and leaves the work of transformation to real key and value objects to 
consumer. The results is not good only a improvement of 13%  for time. I think 
it has someting with the buffer size and the time spending on different 
threads.Maybe the tiem spend by consumer thread is too long and the producer 
has to wait until the next buffer is available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to