I'm investigating a bug where my mapper and reducer tasks run out of memory. It only reproduces when I run on large data sets, so the best way to dig in is to launch my job with sufficiently large inputs on the cluster and monitor the memory characteristics of the failing JVMs remotely. Java Visual VM looks like the tool I want to use. Specifically I want to use it to do heap dumps on my tasks. I can't figure out how to set up the listening end on the cluster nodes, however.
Here is what I have tried: 1. *Turn on JMX remote for the tasks*...I added the following options to mapred.child.java.opts: com.sun.management.jmxremote, com.sun.management.jmxremote.port=8004,com.sun.management.jmxremote.authenticate=false,com.sun.management.jmxremote.ssl = false. This does not work because there is contention for the JMX remote port when multiple tasks run on the same node. All but the first task fail at JVM initialization time, causing the job to fail before I can see the repro. 2. *Use jstatd*...I tried running jstatd in the background on my cluster nodes. It launches and runs, but when I try to connect using Visual VM, nothing happens. I am going to try adding -XX:-HeapDumpOnOutOfMemoryError, which will at least give me post-mortem information. Does anyone know where the heap dump file will be written? Has anyone debugged a similar setup? What tools did you use?
