Hello, I am trying to setup a MapReduce job so that the task JVMs are reused on each cluster node. Libraries used by my MapReduce job have a significant initialization time, mainly creating singletons, and it would be nice if I could make it so that these singletons are only created once per slot, rather than once per task. The input for the job is HBase, so for a large row scan, the initialization time is proving to be quite significant, as the processing done on each row is rather small and the number of tasks is high.
I am setting mapred.job.reuse.jvm.num.tasks to -1 in the job configuration, as stated in the documentation ([1]), yet I am still seeing a different JVM start for each task. This is visible both by watching the processes executing on each node using ps, as well as watching the debugging logs from the job. Otherwise, the job is working as expected. I have tried switching to the deprecated JobConf class and using setNumTasksToExecutePerJvm, but to no avail. I also tried setting mapreduce.job.jvm.numtasks, the equivalent setting in Hadoop 0.21, in case the documentation was out of date, though this did not help either. I have confirmed that mapred.job.reuse.jvm.num.tasks is being transferred to the copy of the job configuration on the task tracker, by looking at the task tracker's copy of job.xml ([2]). I am running Cloudera's cdh3u0 (Hadoop 0.20.2, full version string: 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14) and HBase 0.90.1. Thank you in advance if anyone may be able to shed light on this issue. [1] - http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Task +JVM+Reuse [2] - The following appears in the property file (split across multiple lines by me for readability): <property> <!--Loaded from /mnt/mapred/jt/jobTracker/job_201107281409_0028.xml--> <name>mapred.job.reuse.jvm.num.tasks</name> <value>-1</value> </property> Regards, Brandon Vargo