[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Roelofs updated MAPREDUCE-2041:
------------------------------------

    Attachment: MR-2041.v1.trunk-hadoop-mapreduce.patch

Patch that improves TaskRunner's error-checking.  This makes the failure 
mechanism more obvious but does not address the nondeterministic behavior of 
TestTrackerBlacklistAcrossJobs.  (A minor tweak - removing the "throw ie;" line 
- _does_ fix the test.  However, I'm assuming we don't want to ignore the 
failure to create job-acl.xml in the general case.)

> TaskRunner logDir race condition leads to crash on job-acl.xml creation
> -----------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2041
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2041
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>    Affects Versions: 0.22.0
>         Environment: Linux/x86-64, 32-bit Java, NFS source tree
>            Reporter: Greg Roelofs
>         Attachments: MR-2041.v1.trunk-hadoop-mapreduce.patch
>
>
> TaskRunner's prepareLogFiles() warns on mkdirs() failures but ignores them.  
> It also fails even to check the return value of setPermissions().  Either one 
> can fail (e.g., on NFS, where there appears to be a TOCTOU-style race, except 
> with C = "creation"), in which case the subsequent creation of job-acl.xml in 
> writeJobACLs() will also fail, killing the task:
> {noformat}
> 2010-08-26 20:18:10,334 INFO  mapred.TaskInProgress 
> (TaskInProgress.java:updateStatus(591)) - Error from 
> attempt_20100826201758813_0001_m_000001_0 on 
> tracker_host2.rack.com:rh45-64/127.0.0.1:35112: java.lang.Throwable: Child 
> Error
>     at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:229)
> Caused by: java.io.FileNotFoundException: 
> /home/<username>/grid/trunk/hadoop-mapreduce/build/test/logs/userlogs/job_20100826201758813_0001/attempt_20100826201758813_0001_m_000001_0/job-acl.xml
>  (No such file or directory)
>     at java.io.FileOutputStream.open(Native Method)
>     at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
>     at java.io.FileOutputStream.<init>(FileOutputStream.java:131)
>     at org.apache.hadoop.mapred.TaskRunner.writeJobACLs(TaskRunner.java:307)
>     at 
> org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:290)
>     at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:199)
> {noformat}
> This in turn causes TestTrackerBlacklistAcrossJobs to fail sporadically; the 
> job-acl.xml failure always seems to affect host2 - and to do so more quickly 
> than the intentional exception on host1 - which triggers an assertion failure 
> due to the wrong host being job-blacklisted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to