TaskRunner logDir race condition leads to crash on job-acl.xml creation
-----------------------------------------------------------------------

                 Key: MAPREDUCE-2041
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2041
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: task
    Affects Versions: 0.22.0
         Environment: Linux/x86-64, 32-bit Java, NFS source tree
            Reporter: Greg Roelofs


TaskRunner's prepareLogFiles() warns on mkdirs() failures but ignores them.  It 
also fails even to check the return value of setPermissions().  Either one can 
fail (e.g., on NFS, where there appears to be a TOCTOU-style race, except with 
C = "creation"), in which case the subsequent creation of job-acl.xml in 
writeJobACLs() will also fail, killing the task:

{noformat}
2010-08-26 20:18:10,334 INFO  mapred.TaskInProgress 
(TaskInProgress.java:updateStatus(591)) - Error from 
attempt_20100826201758813_0001_m_000001_0 on 
tracker_host2.rack.com:rh45-64/127.0.0.1:35112: java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:229)
Caused by: java.io.FileNotFoundException: 
/home/<username>/grid/trunk/hadoop-mapreduce/build/test/logs/userlogs/job_20100826201758813_0001/attempt_20100826201758813_0001_m_000001_0/job-acl.xml
 (No such file or directory)
    at java.io.FileOutputStream.open(Native Method)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:131)
    at org.apache.hadoop.mapred.TaskRunner.writeJobACLs(TaskRunner.java:307)
    at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:290)
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:199)
{noformat}

This in turn causes TestTrackerBlacklistAcrossJobs to fail sporadically; the 
job-acl.xml failure always seems to affect host2 - and to do so more quickly 
than the intentional exception on host1 - which triggers an assertion failure 
due to the wrong host being job-blacklisted.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to