Hi Everybody,

I'm having an issue with CDH3u0 where some of my reduce tasks are failing due 
to a Child Error caused by the task JVM exiting with a status of 1. From 
hunting around in the mailing list archives, it seems that this usually happens 
for one of two reasons:

1. The userlog directory has too many subdirectories, so the task fails when 
the necessary logs can't be created. This isn't the case here, since there are 
only a few dozen subdirectories.

2. The mapred.child.ulimit configuration parameter is lower than the max heap 
size set by mapred.child.java.opts. Again, I don't think that this is the 
cause, since I've set mapred.child.ulimit to be about 2GB (the exact value is 
2,097,000), while the heap size set in mapred.child.java.opts is 1024 MB.

There's nothing in the stdout or stderr logs for the failed tasks, and the 
syslog seems normal. There doesn't seem to be anything out of the ordinary in 
the TT log pertaining to the tasks, until the tasks' JVM failure. For 
reference, a task's syslog and an excerpt of the TT log while the task was 
running are available here: https://gist.github.com/1331700. The TT's 
mapred-site.xml (slightly redacted) is available here: 
https://gist.github.com/1331740.

I don't think the issue has anything to do with the code itself, since the code 
is well-tested, and runs fine most of the time. However, I do think that it's 
something to do with memory, since the jobs whose tasks fail are the ones that 
process a lot of data. Could the task JVM exit with status 1 for any reason 
other than the two I listed above (particularly a memory-related reason)? Or am 
I goofing something else up?

Cheers,
Dan Lidral-Porter

Reply via email to