[ http://issues.apache.org/jira/browse/HADOOP-277?page=comments#action_12414901 ]
Owen O'Malley commented on HADOOP-277: -------------------------------------- By the way, if there is a need for the sync block around the mkdirs, we should go ahead and change the function above it so that next week we don't get a third bug. *smile* > Race condition in Configuration.getLocalPath() > ---------------------------------------------- > > Key: HADOOP-277 > URL: http://issues.apache.org/jira/browse/HADOOP-277 > Project: Hadoop > Type: Bug > Environment: linux, 64 bit, dual core, 4x400GB disk, 4GB RAM > Reporter: paul sutter > Attachments: hadoop-277.patch, hadoop-task_1_r_9.log, mkdirs.patch > > (attached: a patch to fix the problem, and a logfile showing the problem > occuring twice) > There is a race condition in Configuration.java: > Path file = new Path(dirs[index], path); > Path dir = file.getParent(); > if (fs.exists(dir) || fs.mkdirs(dir)) { > return file; > If two threads simultaneously process this code with the same target > directory, fs.exists() will return false, but from fs.mkdirs() only one of > the two threads will return true. From the Java documentation: > "returns: true if and only if the directory was created, along with all > necessary parent directories; false otherwise" > That is, if the first thread successfully creates the directory, the second > will not, and therefore return false, even though the directory exists. > This was really happening. We use four temporary directories, and we had > reducers failing all over the place with bizarre impossible errors. I > modified the ReduceTaskRunner to output the filename that it creates to find > the problem, and the log output is below. > Here you can see copies initiated for two files that hash to the same temp > directory, simultaneously. map_4.out is created in the correct directory > (/data2...), but map_15.out is created in the next directory (/data3...) > becuase of this race condition. Minutes later, when the appender tries to > locate the file, that race condition does not occur (the directory already > exists), and the appender looks for the file map_15.out in the correct > directory, where it does not exist. > 060605 142414 task_0001_r_000009_1 Copying task_0001_m_000004_0 output from > rmr05. > 060605 142414 task_0001_r_000009_1 Copying task_0001_m_000015_0 output from > rmr04. > ... > 060605 142416 task_0001_r_000009_1 done copying task_0001_m_000004_0 output > from rmr05 into /data2/tmp/mapred/local/task_0001_r_000009_1/map_4.out > ... > 060605 142418 task_0001_r_000009_1 done copying task_0001_m_000015_0 output > from rmr04 into /data3/tmp/mapred/local/task_0001_r_000009_1/map_15.out > ... > 060605 142531 task_0001_r_000009_1 0.31808624% reduce > append > > /data2/tmp/mapred/local/task_0001_r_000009_1/map_4.out > ... > 060605 142725 task_0001_r_000009_1 java.io.FileNotFoundException: > /data2/tmp/mapred/local/task_0001_r_000009_1/map_15.out -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
