Hi,

I'm investigating the feasibility of a hybrid approach to parallel
programming, by fusing together the concurrent Java fork/join
libraries with Hadoop... MapReduce, a paradigm suited for scalable
execution over distributed memory + fork/join, a paradigm for optimal
multi-threaded shared memory execution.

I am aware that, to a degree, Hadoop can take advantage of multiple
core on a compute node, by setting the
mapred.tasktracker.<map/reduce>.tasks.maximum to be more than one. As
it states in the "Hadoop: The definitive guide" book, each task runs
in a separate JVM. So setting maximum map tasks to 3, will allow the
possibility of 3 JVMs running on the Operating System, right?

As an alternative to this approach, I am looking to introduce
fork/join into map tasks, and perhaps maybe too, throttling down the
maximum number of map tasks per node to 1. I would implement a
RecordReader that produces map tasks for more coarse granularity, and
that fork/join would subdivide to map task using multiple threads -
where the output of the join would be the output of the map task.

The motivation for this is that threads are a lot more lightweight
than initializing JVMs, which as the Hadoop book points out, takes a
second or so each time, unless mapred.job.resuse.jvm.num.tasks is set
higher than 1 (the default is 1). So for example, opting for small
granular maps for a particular job for a given input generates 1,000
map tasks, which will mean that 1,000 JVMs will be created on the
Hadoop cluster within the execution of the program. If I were to write
a bespoke RecordReader to increase the granularity of each map, so
much so that only 100 map tasks are created. The embedded fork/join
code would further split each map into 10 threads, to evaluate within
one JVM concurrently. I would expect this latter approach of multiple
threads to have better performance, than the clunkier multple-JVM
approach.

Has such a hybrid approach combining MapReduce with ForkJoin been
investigated for feasibility, or similar studies published? Are there
any important technical limitations that I should consider? Any
thoughts on the proposed multi-threaded distributed-shared memory
architecture are much appreciated from the Hadoop community!

--
Rob Stewart

Reply via email to