Hi, I'm investigating the feasibility of a hybrid approach to parallel programming, by fusing together the concurrent Java fork/join libraries with Hadoop... MapReduce, a paradigm suited for scalable execution over distributed memory + fork/join, a paradigm for optimal multi-threaded shared memory execution.
I am aware that, to a degree, Hadoop can take advantage of multiple core on a compute node, by setting the mapred.tasktracker.<map/reduce>.tasks.maximum to be more than one. As it states in the "Hadoop: The definitive guide" book, each task runs in a separate JVM. So setting maximum map tasks to 3, will allow the possibility of 3 JVMs running on the Operating System, right? As an alternative to this approach, I am looking to introduce fork/join into map tasks, and perhaps maybe too, throttling down the maximum number of map tasks per node to 1. I would implement a RecordReader that produces map tasks for more coarse granularity, and that fork/join would subdivide to map task using multiple threads - where the output of the join would be the output of the map task. The motivation for this is that threads are a lot more lightweight than initializing JVMs, which as the Hadoop book points out, takes a second or so each time, unless mapred.job.resuse.jvm.num.tasks is set higher than 1 (the default is 1). So for example, opting for small granular maps for a particular job for a given input generates 1,000 map tasks, which will mean that 1,000 JVMs will be created on the Hadoop cluster within the execution of the program. If I were to write a bespoke RecordReader to increase the granularity of each map, so much so that only 100 map tasks are created. The embedded fork/join code would further split each map into 10 threads, to evaluate within one JVM concurrently. I would expect this latter approach of multiple threads to have better performance, than the clunkier multple-JVM approach. Has such a hybrid approach combining MapReduce with ForkJoin been investigated for feasibility, or similar studies published? Are there any important technical limitations that I should consider? Any thoughts on the proposed multi-threaded distributed-shared memory architecture are much appreciated from the Hadoop community! -- Rob Stewart
