[
https://issues.apache.org/jira/browse/MAPREDUCE-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Greg Roelofs updated MAPREDUCE-2041:
------------------------------------
Attachment: MR-2041.v1.trunk-hadoop-mapreduce.patch
Patch that improves TaskRunner's error-checking. This makes the failure
mechanism more obvious but does not address the nondeterministic behavior of
TestTrackerBlacklistAcrossJobs. (A minor tweak - removing the "throw ie;" line
- _does_ fix the test. However, I'm assuming we don't want to ignore the
failure to create job-acl.xml in the general case.)
> TaskRunner logDir race condition leads to crash on job-acl.xml creation
> -----------------------------------------------------------------------
>
> Key: MAPREDUCE-2041
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2041
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: task
> Affects Versions: 0.22.0
> Environment: Linux/x86-64, 32-bit Java, NFS source tree
> Reporter: Greg Roelofs
> Attachments: MR-2041.v1.trunk-hadoop-mapreduce.patch
>
>
> TaskRunner's prepareLogFiles() warns on mkdirs() failures but ignores them.
> It also fails even to check the return value of setPermissions(). Either one
> can fail (e.g., on NFS, where there appears to be a TOCTOU-style race, except
> with C = "creation"), in which case the subsequent creation of job-acl.xml in
> writeJobACLs() will also fail, killing the task:
> {noformat}
> 2010-08-26 20:18:10,334 INFO mapred.TaskInProgress
> (TaskInProgress.java:updateStatus(591)) - Error from
> attempt_20100826201758813_0001_m_000001_0 on
> tracker_host2.rack.com:rh45-64/127.0.0.1:35112: java.lang.Throwable: Child
> Error
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:229)
> Caused by: java.io.FileNotFoundException:
> /home/<username>/grid/trunk/hadoop-mapreduce/build/test/logs/userlogs/job_20100826201758813_0001/attempt_20100826201758813_0001_m_000001_0/job-acl.xml
> (No such file or directory)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
> at java.io.FileOutputStream.<init>(FileOutputStream.java:131)
> at org.apache.hadoop.mapred.TaskRunner.writeJobACLs(TaskRunner.java:307)
> at
> org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:290)
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:199)
> {noformat}
> This in turn causes TestTrackerBlacklistAcrossJobs to fail sporadically; the
> job-acl.xml failure always seems to affect host2 - and to do so more quickly
> than the intentional exception on host1 - which triggers an assertion failure
> due to the wrong host being job-blacklisted.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.