[
https://issues.apache.org/jira/browse/HADOOP-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072432#comment-14072432
]
Allen Wittenauer commented on HADOOP-6208:
------------------------------------------
Yes, an update would be good.
I suspect this is actually close-able now.
> Block loss in S3FS due to S3 inconsistency on file rename
> ---------------------------------------------------------
>
> Key: HADOOP-6208
> URL: https://issues.apache.org/jira/browse/HADOOP-6208
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/s3
> Affects Versions: 0.20.0, 0.20.1
> Environment: Ubuntu Linux 8.04 on EC2, Mac OS X 10.5, likely to
> affect any Hadoop environment
> Reporter: Bradley Buda
> Assignee: Bradley Buda
> Attachments: HADOOP-6208.patch, S3FSConsistencyPollingTest.java,
> S3FSConsistencyTest.java
>
>
> Under certain S3 consistency scenarios, Hadoop's S3FileSystem can 'truncate'
> files, especially when writing reduce outputs. We've noticed this at
> tracksimple where we use the S3FS as the direct input and output of our
> MapReduce jobs. The symptom of this problem is a file in the filesystem that
> is an exact multiple of the FS block size - exactly 32MB, 64MB, 96MB, etc. in
> length.
> The issue appears to be caused by renaming a file that has recently been
> written, and getting a stale INode read from S3. When a reducer is writing
> job output to the S3FS, the normal series of S3 key writes for a 3-block file
> looks something like this:
> Task Output:
> 1) Write the first block (block_99)
> 2) Write an INode
> (/myjob/_temporary/_attempt_200907142159_0306_r_000133_0/part-00133.gz)
> containing [block_99]
> 3) Write the second block (block_81)
> 4) Rewrite the INode with new contents [block_99, block_81]
> 5) Write the last block (block_-101)
> 6) Rewrite the INode with the final contents [block_99, block_81, block_-101]
> Copy Output to Final Location (ReduceTask#copyOutput):
> 1) Read the INode contents from
> /myjob/_temporary/_attempt_200907142159_0306_r_000133_0/part-00133.gz, which
> gives [block_99, block_81, block_-101]
> 2) Write the data from #1 to the final location, /myjob/part-00133.gz
> 3) Delete the old INode
> The output file is truncated if S3 serves a stale copy of the temporary
> INode. In copyOutput, step 1 above, it is possible for S3 to return a
> version of the temporary INode that contains just [block_99, block_81]. In
> this case, we write this new data to the final output location, and 'lose'
> block_-101 in the process. Since we then delete the temporary INode, we've
> lost all references to the final block of this file and it's orphaned in the
> S3 bucket.
> This type of consistency error is infrequent but not impossible. We've
> observed these failures about once a week for one of our large jobs which
> runs daily and has 200 reduce outputs; so we're seeing an error rate of
> something like 0.07% per reduce.
> These kind of errors are generally difficult to handle in a system like S3.
> We have a few ideas about how to fix this:
> 1) HACK! Sleep during S3OutputStream#close or #flush to wait for S3 to catch
> up and make these less likely.
> 2) Poll for updated MD5 or INode data in Jets3tFileSystemStore#storeINode
> until S3 says the INode contents are the same as our local copy. This could
> be a config option - "fs.s3.verifyInodeWrites" or something like that.
> 3) Cache INode contents in-process, so we don't have to go back to S3 to ask
> for the current version of an INode.
> 4) Only write INodes once, when the output stream is closed. This would
> basically make S3OutputStream#flush() a no-op.
> 5) Modify the S3FS to somehow version INodes (unclear how we would do this,
> need some design work).
> 6) Avoid using the S3FS for temporary task attempt files.
> 7) Avoid using the S3FS completely.
> We wanted to get some guidance from the community before we went down any of
> these paths. Has anyone seen this issue? Any other suggested workarounds?
> We at tracksimple are willing to invest some time in fixing this and (of
> course) contributing our fix back, but we wanted to get an 'ack' from others
> before we try anything crazy :-).
> I've attached a test app if anyone wants to try and reproduce this
> themselves. It takes a while to run (depending on the 'weather' in S3 right
> now), but should eventually detect a consistency 'error' that manifests
> itself as a truncated file.
--
This message was sent by Atlassian JIRA
(v6.2#6252)