Hi, While exploring how Hadoop fits in our usage scenarios, there are 4 recurring issues keep popping up. I don't know if they are real issues or just our misunderstanding of Hadoop. Can any expert shed some light here ?
Disk I/O overhead ================== - The output of a Map task is written to a local disk and then later on upload to the Reduce task. While this enable a simple recovery strategy when the map task failed, it incur additional disk I/O overhead. - For example, in our popular Hadoop example of calculating the approximation of "Pi", there isn't any input data. The map tasks in this example, should just directly feed its output to the reduce task. So I am wondering if there is an option to bypassing the step of writing the map result to the local disk. Pipelining between Map & Reduce phases is not possible ======================================================= - In the current setting, it sounds like no reduce task will be started before all map tasks have completed. In case if there are a few slow running map tasks, the whole job will be delayed. - The overall job execution can be shortened if the reduce tasks can starts its processing as soon as some map results are available rather than waiting for all the map tasks to complete. Pipelining between jobs ======================== - In many cases, we've found the parallel computation doesn't involve just one single map/reduce job, but multiple inter-dependent map/reduce jobs then work together in some coordinating fashion. - Again, I haven't seen any mechanism available for 2 MapReduce jobs to directly interact with each other. Job1 must write its output to HDFS for Job2 to pickup. On the other hand, once the "map" phase of a Job2 has started, all its input HDFS files has to be freezed (in other words, Job1 cannot append more records into the HDFS files) - Therefore it is impossible for the reduce phase of Job1 to stream its output data to a file while the map phase of Job2 start reading the same file. Job2 can only start after ALL REDUCE TASKS of Job1 is completed, which makes pipelining between jobs impossible. No parallelism of reduce task with one key =========================================== - Parallelism only happens in the map phase, as well as reduce phase (on different keys). But there is no parallelism within a reduce process of a particular key - This means the partitioning function has to be chosen carefully to make sure the workload of the reduce processes is balanced. (maybe not a big deal) - Is there any thoughts of running a pool of reduce tasks on the same key and have they combine their results later ? Rgds, Ricky
