Rob, Hadoop has as a way to run Map tasks in multithreading mode, look for the MultithreadedMapRunner & MultithreadedMapper.
Thanks. Alejandro. On Tue, Jan 31, 2012 at 7:51 AM, Rob Stewart <[email protected]> wrote: > Hi, > > I'm investigating the feasibility of a hybrid approach to parallel > programming, by fusing together the concurrent Java fork/join > libraries with Hadoop... MapReduce, a paradigm suited for scalable > execution over distributed memory + fork/join, a paradigm for optimal > multi-threaded shared memory execution. > > I am aware that, to a degree, Hadoop can take advantage of multiple > core on a compute node, by setting the > mapred.tasktracker.<map/reduce>.tasks.maximum to be more than one. As > it states in the "Hadoop: The definitive guide" book, each task runs > in a separate JVM. So setting maximum map tasks to 3, will allow the > possibility of 3 JVMs running on the Operating System, right? > > As an alternative to this approach, I am looking to introduce > fork/join into map tasks, and perhaps maybe too, throttling down the > maximum number of map tasks per node to 1. I would implement a > RecordReader that produces map tasks for more coarse granularity, and > that fork/join would subdivide to map task using multiple threads - > where the output of the join would be the output of the map task. > > The motivation for this is that threads are a lot more lightweight > than initializing JVMs, which as the Hadoop book points out, takes a > second or so each time, unless mapred.job.resuse.jvm.num.tasks is set > higher than 1 (the default is 1). So for example, opting for small > granular maps for a particular job for a given input generates 1,000 > map tasks, which will mean that 1,000 JVMs will be created on the > Hadoop cluster within the execution of the program. If I were to write > a bespoke RecordReader to increase the granularity of each map, so > much so that only 100 map tasks are created. The embedded fork/join > code would further split each map into 10 threads, to evaluate within > one JVM concurrently. I would expect this latter approach of multiple > threads to have better performance, than the clunkier multple-JVM > approach. > > Has such a hybrid approach combining MapReduce with ForkJoin been > investigated for feasibility, or similar studies published? Are there > any important technical limitations that I should consider? Any > thoughts on the proposed multi-threaded distributed-shared memory > architecture are much appreciated from the Hadoop community! > > -- > Rob Stewart >
