Hello,

I am trying to setup a MapReduce job so that the task JVMs are reused on
each cluster node. Libraries used by my MapReduce job have a significant
initialization time, mainly creating singletons, and it would be nice if
I could make it so that these singletons are only created once per slot,
rather than once per task. The input for the job is HBase, so for a
large row scan, the initialization time is proving to be quite
significant, as the processing done on each row is rather small and the
number of tasks is high.

I am setting mapred.job.reuse.jvm.num.tasks to -1 in the job
configuration, as stated in the documentation ([1]), yet I am still
seeing a different JVM start for each task. This is visible both by
watching the processes executing on each node using ps, as well as
watching the debugging logs from the job. Otherwise, the job is working
as expected.

I have tried switching to the deprecated JobConf class and using
setNumTasksToExecutePerJvm, but to no avail. I also tried setting
mapreduce.job.jvm.numtasks, the equivalent setting in Hadoop 0.21, in
case the documentation was out of date, though this did not help either.

I have confirmed that mapred.job.reuse.jvm.num.tasks is being
transferred to the copy of the job configuration on the task tracker, by
looking at the task tracker's copy of job.xml ([2]).

I am running Cloudera's cdh3u0 (Hadoop 0.20.2, full version string:
0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14) and HBase
0.90.1.

Thank you in advance if anyone may be able to shed light on this issue.

[1] -
http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task
+JVM+Reuse

[2] - The following appears in the property file (split across multiple
lines by me for readability):
<property>
  <!--Loaded from /mnt/mapred/jt/jobTracker/job_201107281409_0028.xml-->
  <name>mapred.job.reuse.jvm.num.tasks</name>
  <value>-1</value>
</property>

Regards,

Brandon Vargo

Reply via email to