[jira] [Commented] (HADOOP-6208) Block loss in S3FS due to S3 inconsistency on file rename

Allen Wittenauer (JIRA) Wed, 23 Jul 2014 14:57:12 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072432#comment-14072432
 ]


Allen Wittenauer commented on HADOOP-6208:
------------------------------------------

Yes, an update would be good.

I suspect this is actually close-able now.

> Block loss in S3FS due to S3 inconsistency on file rename
> ---------------------------------------------------------
>
>                 Key: HADOOP-6208
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6208
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 0.20.0, 0.20.1
>         Environment: Ubuntu Linux 8.04 on EC2, Mac OS X 10.5, likely to 
> affect any Hadoop environment
>            Reporter: Bradley Buda
>            Assignee: Bradley Buda
>         Attachments: HADOOP-6208.patch, S3FSConsistencyPollingTest.java, 
> S3FSConsistencyTest.java
>
>
> Under certain S3 consistency scenarios, Hadoop's S3FileSystem can 'truncate' 
> files, especially when writing reduce outputs.  We've noticed this at 
> tracksimple where we use the S3FS as the direct input and output of our 
> MapReduce jobs.  The symptom of this problem is a file in the filesystem that 
> is an exact multiple of the FS block size - exactly 32MB, 64MB, 96MB, etc. in 
> length.
> The issue appears to be caused by renaming a file that has recently been 
> written, and getting a stale INode read from S3.  When a reducer is writing 
> job output to the S3FS, the normal series of S3 key writes for a 3-block file 
> looks something like this:
> Task Output:
> 1) Write the first block (block_99)
> 2) Write an INode 
> (/myjob/_temporary/_attempt_200907142159_0306_r_000133_0/part-00133.gz) 
> containing [block_99]
> 3) Write the second block (block_81)
> 4) Rewrite the INode with new contents [block_99, block_81]
> 5) Write the last block (block_-101)
> 6) Rewrite the INode with the final contents [block_99, block_81, block_-101]
> Copy Output to Final Location (ReduceTask#copyOutput):
> 1) Read the INode contents from 
> /myjob/_temporary/_attempt_200907142159_0306_r_000133_0/part-00133.gz, which 
> gives [block_99, block_81, block_-101]
> 2) Write the data from #1 to the final location, /myjob/part-00133.gz
> 3) Delete the old INode 
> The output file is truncated if S3 serves a stale copy of the temporary 
> INode.  In copyOutput, step 1 above, it is possible for S3 to return a 
> version of the temporary INode that contains just [block_99, block_81].  In 
> this case, we write this new data to the final output location, and 'lose' 
> block_-101 in the process.  Since we then delete the temporary INode, we've 
> lost all references to the final block of this file and it's orphaned in the 
> S3 bucket.
> This type of consistency error is infrequent but not impossible. We've 
> observed these failures about once a week for one of our large jobs which 
> runs daily and has 200 reduce outputs; so we're seeing an error rate of 
> something like 0.07% per reduce.
> These kind of errors are generally difficult to handle in a system like S3.  
> We have a few ideas about how to fix this:
> 1) HACK! Sleep during S3OutputStream#close or #flush to wait for S3 to catch 
> up and make these less likely.
> 2) Poll for updated MD5 or INode data in Jets3tFileSystemStore#storeINode 
> until S3 says the INode contents are the same as our local copy.  This could 
> be a config option - "fs.s3.verifyInodeWrites" or something like that.
> 3) Cache INode contents in-process, so we don't have to go back to S3 to ask 
> for the current version of an INode.
> 4) Only write INodes once, when the output stream is closed.  This would 
> basically make S3OutputStream#flush() a no-op.
> 5) Modify the S3FS to somehow version INodes (unclear how we would do this, 
> need some design work).
> 6) Avoid using the S3FS for temporary task attempt files.
> 7) Avoid using the S3FS completely.
> We wanted to get some guidance from the community before we went down any of 
> these paths.  Has anyone seen this issue?  Any other suggested workarounds?  
> We at tracksimple are willing to invest some time in fixing this and (of 
> course) contributing our fix back, but we wanted to get an 'ack' from others 
> before we try anything crazy :-).
> I've attached a test app if anyone wants to try and reproduce this 
> themselves.  It takes a while to run (depending on the 'weather' in S3 right 
> now), but should eventually detect a consistency 'error' that manifests 
> itself as a truncated file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HADOOP-6208) Block loss in S3FS due to S3 inconsistency on file rename

Reply via email to