John Elliott created MAPREDUCE-4947: ---------------------------------------
Summary: Random task failures during TeraSort job Key: MAPREDUCE-4947 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4947 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 1.0.1, 1.0.0, 0.20.205.0 Environment: RHEL 6.2 4 datanodes one xfs filesystem per datanode 2 quad core CPU's per datanode 48 GB memory per datanode 10GbE node interconnect jdk1.6.0_32 Reporter: John Elliott Priority: Minor During most of my terasort jobs, I see occasional, random map task failures during the reduce phase. Usually there will be only 1-4 task failures during a job, with the job completing successfully. On rare occasions, a tasktracker will be blacklisted. Below are the usual error messages: ======================================== NFO mapred.JobClient: Task Id : attempt_201301151521_0002_m_005954_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 126. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) WARN mapred.JobClient: Error reading task outputhttp://datanode3:50060/tasklog?plaintext=true&attemptid=attempt_201301151521_0002_m_005954_0&filter=stdout WARN mapred.JobClient: Error reading task outputhttp://datanode3:50060/tasklog?plaintext=true&attemptid=attempt_201301151521_0002_m_005954_0&filter=stderr ========================================== Tasktracker nodes are considered for 8 map and 7 reduce slots each for a total of 32 map slots and 28 reduce slots for the 4 datanode cluster. The problem never occurs, during teragen jobs and only occur after reduce copies start. Cutting the number of slots in half helps to reduce the frequency, but the problem still occurs. Actions taken without any success: ulimit increases for nproc and nofile to 32768 and then 65536 setting MALLOC_ARENA_MAX=4 in the hadoop-env.sh file per HADOOP-7154. heapsize increases and reductions reduction of map and reduce slots as stated above various modifications of mapreduce and hdfs properties I've done quite a bit of testing with CDH3 on the same hardware and not encountered this problem, so I suspect there may be a bug fix or patch I'm missing. Any suggestions for further isolating the problem or application of patches would be much appreciated. Thanks in advance! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira