is it possible to add parallelism to org.apache.hadoop.mapred.MapTask.MapOutputBuffer.sortAndSpill()

hongbin ma Mon, 22 Feb 2016 03:10:13 -0800

Hi experts,

My MR job contains 1000 mappers and 500 reduces, and the average time for
mapper and reducer is both 8~9 minutes. When I checked the mappers' log I
found that step sortAndSpill is spending a significant portion of time.


By my observation and code studying, in sortAndSpill each mapper will sort
all the mapper outputs first, and then output each partition's portion
sequentially to IFile (in my case the output process is accompanied by
compression). The output process for each partition takes only a while
(less than one second), but the 500 such process accumulates hundreds of
seconds.

Since each partitions output process is interleaved with CPU intensive
instructions(compression) and IO intensive constructions(write to disk), it
is natural for me to think about parallelism. However it's surprising to
find that even in the latest branches the sortAndSpill step is still
sequential. So I'm wondering if someone has did research on this and prove
it not to work? Or do we have concerns for this kind of optimization?

Thanks in advance

-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

is it possible to add parallelism to org.apache.hadoop.mapred.MapTask.MapOutputBuffer.sortAndSpill()

Reply via email to