Hi Everybody, I'm having an issue with CDH3u0 where some of my reduce tasks are failing due to a Child Error caused by the task JVM exiting with a status of 1. From hunting around in the mailing list archives, it seems that this usually happens for one of two reasons:
1. The userlog directory has too many subdirectories, so the task fails when the necessary logs can't be created. This isn't the case here, since there are only a few dozen subdirectories. 2. The mapred.child.ulimit configuration parameter is lower than the max heap size set by mapred.child.java.opts. Again, I don't think that this is the cause, since I've set mapred.child.ulimit to be about 2GB (the exact value is 2,097,000), while the heap size set in mapred.child.java.opts is 1024 MB. There's nothing in the stdout or stderr logs for the failed tasks, and the syslog seems normal. There doesn't seem to be anything out of the ordinary in the TT log pertaining to the tasks, until the tasks' JVM failure. For reference, a task's syslog and an excerpt of the TT log while the task was running are available here: https://gist.github.com/1331700. The TT's mapred-site.xml (slightly redacted) is available here: https://gist.github.com/1331740. I don't think the issue has anything to do with the code itself, since the code is well-tested, and runs fine most of the time. However, I do think that it's something to do with memory, since the jobs whose tasks fail are the ones that process a lot of data. Could the task JVM exit with status 1 for any reason other than the two I listed above (particularly a memory-related reason)? Or am I goofing something else up? Cheers, Dan Lidral-Porter
