[jira] [Commented] (MAPREDUCE-2846) approx 10% of all tasks fail with DefaultTaskController

Allen Wittenauer (JIRA) Thu, 18 Aug 2011 15:00:55 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087331#comment-13087331
 ]


Allen Wittenauer commented on MAPREDUCE-2846:
---------------------------------------------

Some relevant properties:

 <property>
    <name>mapred.local.dir</name>
    
<value>/grid/a/mapred/local,/grid/b/mapred/local,/grid/c/mapred/local,/grid/d/mapred/local,/grid/e/mapred/local,/grid/f/mapred/local</value>
  </property>

  <property>
    <name>hadoop.job.history.user.location</name>
    <value>none</value>
    <final>true</final>
  </property>

  <property>
    <name>hadoop.tmp.dir</name>
    <value>/grid/a/mapred/tmp/hadoop-${user.name}</value>
    <description>A base for other temporary directories.</description>
    <final>true</final>
  </property>

The permissions on these dirs are 775.  User and group match the user we run 
the tasktracker as.  (So, with DefaultTaskController, this should work just 
fine.)

Some other questions I've been asked over IM:

* Nodes can show failures with one run, be perfectly clean the next, then show 
failures during a third run.  Some nodes will throw failures during all three.
* This problem is reflected in both map tasks and reduce tasks.
* The dir permissions really are the same across all dirs and all nodes. :)
* I have not tried LTC because my test grid is not configured to support it yet.
* I've been testing the Apache releases with no custom patches other than 
including the LZO bits.
* The number of failures per run is wildly inconsistent.
* Running 203 on the same gear with the same config shows zero failures.  So 
this is clearly a result of something added in 204.
* Yes, enough tasks have failed during certain runs that tasktrackers are 
getting blacklisted from the job.

I'm currently playing with a debug jar from Owen to try and gather more 
information.  Part of the problem is that there clearly isn't enough 
information on why tasks are failing.  The tasktracker logs throw the symlink 
error but see MAPREDUCE-2804.  The child error stack trace:

{code}
java.lang.Throwable: Child Error
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of -1.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
{code}

is equally unhelpful.

> approx 10% of all tasks fail with DefaultTaskController
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-2846
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2846
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task, task-controller, tasktracker
>    Affects Versions: 0.20.204.0
>            Reporter: Allen Wittenauer
>            Priority: Blocker
>
> After upgrading our test 0.20.203 grid to 0.20.204-rc2, we ran terasort to 
> verify operation.  While the job completed successfully, approx 10% of the 
> tasks failed with task runner execution errors and the inability to create 
> symlinks for attempt logs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2846) approx 10% of all tasks fail with DefaultTaskController

Reply via email to