Re: Hadoop Design Question

Bryan Duxbury Thu, 06 Nov 2008 10:37:00 -0800

Comments inline.

On Nov 6, 2008, at 9:29 AM, Ricky Ho wrote:

Hi,
While exploring how Hadoop fits in our usage scenarios, there are 4recurring issues keep popping up. I don't know if they are realissues or just our misunderstanding of Hadoop. Can any expert shedsome light here ?
Disk I/O overhead
==================
- The output of a Map task is written to a local disk and thenlater on upload to the Reduce task. While this enable a simplerecovery strategy when the map task failed, it incur additionaldisk I/O overhead.
- For example, in our popular Hadoop example of calculating theapproximation of "Pi", there isn't any input data. The map tasksin this example, should just directly feed its output to the reducetask. So I am wondering if there is an option to bypassing thestep of writing the map result to the local disk.

In most data-intensive map/reduce jobs, you have to spill your mapoutput to disk at some point because you will run out of memoryotherwise. Additionally, Pi calculation is a really bad example,because you could always start "reducing" any pairs togetherarbitrarily. This is because pi calculation is commutative andassociative. We have a special construct for situations like thatcalled a "combiner", which is basically a map-side reducer.

Pipelining between Map & Reduce phases is not possible
=======================================================
- In the current setting, it sounds like no reduce task will bestarted before all map tasks have completed. In case if there area few slow running map tasks, the whole job will be delayed.
- The overall job execution can be shortened if the reduce taskscan starts its processing as soon as some map results are availablerather than waiting for all the map tasks to complete.

You can't start reducing until all map tasks are complete becauseuntil all map tasks complete, you can't do an accurate sort of allintermediate key/value pairs. That is, if you just started reducingthe results of a single map task immediately, you might have othervalues for some keys that come from different map tasks, and yourreduce would be inaccurate. In theory if you know that each map taskproduces keys only in a certain range, you could start reducingimmediately after the map task finishes, but that seems like anunlikely case.

Pipelining between jobs
========================
- In many cases, we've found the parallel computation doesn'tinvolve just one single map/reduce job, but multiple inter-dependent map/reduce jobs then work together in some coordinatingfashion.
- Again, I haven't seen any mechanism available for 2 MapReducejobs to directly interact with each other. Job1 must write itsoutput to HDFS for Job2 to pickup. On the other hand, once the"map" phase of a Job2 has started, all its input HDFS files has tobe freezed (in other words, Job1 cannot append more records intothe HDFS files)
- Therefore it is impossible for the reduce phase of Job1 to streamits output data to a file while the map phase of Job2 start readingthe same file. Job2 can only start after ALL REDUCE TASKS of Job1is completed, which makes pipelining between jobs impossible.

Certainly, many transformations take more than one map/reduce job.However, very few could actually be pipelined such that the output ofone fed directly into another without an intermediate stop in a file.If the first job does any grouping or sorting, then the reduce isnecessary and it will have to write out to a file before anythingelse can go on. If the second job also does grouping or sorting, thenyou definitely need two jobs. If the second job doesn't do groupingor sorting, then it can probably be collapsed into either the map orreduce of the first job.

No parallelism of reduce task with one key
===========================================
- Parallelism only happens in the map phase, as well as reducephase (on different keys). But there is no parallelism within areduce process of a particular key
- This means the partitioning function has to be chosen carefullyto make sure the workload of the reduce processes is balanced.(maybe not a big deal)
- Is there any thoughts of running a pool of reduce tasks on thesame key and have they combine their results later ?

I think you will find very few situations where you have only one keyon reduce. If you do, it's probably a scenario where you can use acombiner and eliminate the problem. Basically all map/reduce jobsI've worked on have a large number of keys going into the reduce phase.


Rgds, Ricky

Re: Hadoop Design Question

Reply via email to