[
https://issues.apache.org/jira/browse/MAPREDUCE-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904508#action_12904508
]
Greg Roelofs commented on MAPREDUCE-2041:
-----------------------------------------
Just FYI, I ran TestTrackerBlacklistAcrossJobs 16 times without any failures on
a local (ext3 on md) filesystem on the same node as above. The nondeterminism
definitely seems to be associated either with non-guaranteed filesystem
semantics (i.e., bad assumptions in the test and/or MR code) or with network
timing and asynchronous function calls (which I guess also devolves to bad
assumptions in the test and/or MR code). Given that NFS is normally relevant
only for development and personal runs of "ant test" (and frequently not even
there), this doesn't seem like critical problem.
If anyone ever wants to track down other NFS-related failures, however, this
might be a useful test case to get started.
> TaskRunner logDir race condition leads to crash on job-acl.xml creation
> -----------------------------------------------------------------------
>
> Key: MAPREDUCE-2041
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2041
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: task
> Affects Versions: 0.22.0
> Environment: Linux/x86-64, 32-bit Java, NFS source tree
> Reporter: Greg Roelofs
> Attachments: MR-2041.v1.trunk-hadoop-mapreduce.patch
>
>
> TaskRunner's prepareLogFiles() warns on mkdirs() failures but ignores them.
> It also fails even to check the return value of setPermissions(). Either one
> can fail (e.g., on NFS, where there appears to be a TOCTOU-style race, except
> with C = "creation"), in which case the subsequent creation of job-acl.xml in
> writeJobACLs() will also fail, killing the task:
> {noformat}
> 2010-08-26 20:18:10,334 INFO mapred.TaskInProgress
> (TaskInProgress.java:updateStatus(591)) - Error from
> attempt_20100826201758813_0001_m_000001_0 on
> tracker_host2.rack.com:rh45-64/127.0.0.1:35112: java.lang.Throwable: Child
> Error
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:229)
> Caused by: java.io.FileNotFoundException:
> /home/<username>/grid/trunk/hadoop-mapreduce/build/test/logs/userlogs/job_20100826201758813_0001/attempt_20100826201758813_0001_m_000001_0/job-acl.xml
> (No such file or directory)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
> at java.io.FileOutputStream.<init>(FileOutputStream.java:131)
> at org.apache.hadoop.mapred.TaskRunner.writeJobACLs(TaskRunner.java:307)
> at
> org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:290)
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:199)
> {noformat}
> This in turn causes TestTrackerBlacklistAcrossJobs to fail sporadically; the
> job-acl.xml failure always seems to affect host2 - and to do so more quickly
> than the intentional exception on host1 - which triggers an assertion failure
> due to the wrong host being job-blacklisted.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.