Hi,
I am looking at the feature of multithreaded map tasks. I find that the new API provides org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper class to enable multi-thread in each map task. We can also set the number of threads in the thread pool that will run the map function by setNumberOfThreads API. Here I want to clarify the scenarios in which we should enable the multithreaded map tasks. Generally, Hadoop MapReduce provides the mapred.tasktracker.map.tasks.maximum parameter to control capacity of concurrent map tasks (also we have corresponding parameter for reduce tasks). We can start more child task JVM to increase CPU utilization. We do not need multithreaded tasks in most scenarios. However, multithreaded tasks may be enabled in the specific scenarios: 1) When the workload is bounded by Memory or I/O, not CPU. For example, we want load input of running map task into memory, and we can only load 50 GB input to the cluster at most, but the CPU of the cluster is not fully utilized. Then we can enable multithreaded tasks to increase the CPU utilization. 2) When the tasks are unbalanced. I have encountered this problem when I process very large social graphs. If I assigned 200 map tasks (averagely 8 concurrent map tasks for each node, totally 7 nodes), 99% of tasks complete within 1 hour. But the rest 1% of tasks will take more than 10 hours. This is caused by un-balanced degree distribution of the social graph. The CPU utilization of the running node is lower than 20% when most tasks complete. I think that we can enable multi-threaded tasks now to increase the CPU utilization. My questions: 1. Is above understanding right? 2. Why there’s no multithreaded reducer interface? 3. How to set right number of thread? (The number to enable all cores being utilized?) 4. I see some prior articles point out that we should pay attention to thread safe when using multithreaded mapper. I can not quite understand this. The basic model of MapReduce enables the naturally isolation of each key. I guess a key should be processed within a thread even if we enable the multithreaded mapper, how could multiple threads interact with each other? Discussion and comments are welcomed! -- - Juwei