[ http://issues.apache.org/jira/browse/HADOOP-723?page=all ]
Owen O'Malley updated HADOOP-723:
---------------------------------
Attachment: map-out-lost.patch
This patch:
1. Gives each map copier thread a unique id.
2. Uses the id to generate unique names for the fetched files.
3. Adds timeouts for the connect/read in getFile.
4. Adds checks for interrupted in getFile.
5. Add exception handlers inside the while loop of the map output copier
threads and the map output lease thread.
6. Add synchronization around the renames of the temporary files so that
there aren't any race conditions.
We might consider putting this one into a 0.8.1 release.
> Race condition exists in the method MapOutputLocation.getFile
> -------------------------------------------------------------
>
> Key: HADOOP-723
> URL: http://issues.apache.org/jira/browse/HADOOP-723
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.8.0
> Reporter: Devaraj Das
> Assigned To: Owen O'Malley
> Fix For: 0.9.0
>
> Attachments: map-out-lost.patch
>
>
> There seems to be a race condition in the way the Reduces copy the map output
> files from the Maps. If a copier is blocked in the connect method (in the
> beginning of the method MapOutputLocation.getFile) to a Jetty on a Map, and
> the MapCopyLeaseChecker detects that the copier was idle for too long, it
> will go ahead and issue a interrupt (read 'kill') to this thread and create a
> new Copier thread. However, the copier, currently blocked trying to connect
> to Jetty on a Map, doesn't actually get killed until the connect timeout
> expires and as soon as the connect comes out (with an IOException), it will
> delete the map output file which actually could have been (successfully)
> created by the new Copier thread. This leads to the Sort phase for that
> reducer failing with a FileNotFoundException.
> One simple way to fix this is to not delete the file if the file was not
> created within this getFile method.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira