[ 
https://issues.apache.org/jira/browse/HADOOP-6208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12747591#action_12747591
 ] 

Raghu Angadi commented on HADOOP-6208:
--------------------------------------

This is the first time I am looking at S3FS. It is essentially a thin client 
managed FS layer over S3.. it would pretty hard to avoid such inconsistencies 
especially S3 itself can return stale info. isn't there an option for S3 client 
to get the latest copy (thinking Dynamo-like key-value reads)?

Between options (1) to (5) : (2) is probably the best. It is a bit analogous to 
what HDFS does : when close() completes successfully, it implies all the blocks 
have at least one replica reported.

Regd (6)-(7) :  a loosely managed FS will always such issues. I am not sure how 
much is S3FS is used in production, I would be curious to know.

For those familiar with HDFS: S3FS is a broadly similar where all the metadata 
is manipulated by S3FS client rather than a central server (e.g. rename moves 
metadata for each file under the tree to a new location). While writing, each 
block is written locally and stored as an S3Object when full.

> Block loss in S3FS due to S3 inconsistency on file rename
> ---------------------------------------------------------
>
>                 Key: HADOOP-6208
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6208
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs/s3
>    Affects Versions: 0.20.0
>         Environment: Ubuntu Linux 8.04 on EC2, Mac OS X 10.5, likely to 
> affect any Hadoop environment
>            Reporter: Bradley Buda
>         Attachments: S3FSConsistencyTest.java
>
>
> Under certain S3 consistency scenarios, Hadoop's S3FileSystem can 'truncate' 
> files, especially when writing reduce outputs.  We've noticed this at 
> tracksimple where we use the S3FS as the direct input and output of our 
> MapReduce jobs.  The symptom of this problem is a file in the filesystem that 
> is an exact multiple of the FS block size - exactly 32MB, 64MB, 96MB, etc. in 
> length.
> The issue appears to be caused by renaming a file that has recently been 
> written, and getting a stale INode read from S3.  When a reducer is writing 
> job output to the S3FS, the normal series of S3 key writes for a 3-block file 
> looks something like this:
> Task Output:
> 1) Write the first block (block_99)
> 2) Write an INode 
> (/myjob/_temporary/_attempt_200907142159_0306_r_000133_0/part-00133.gz) 
> containing [block_99]
> 3) Write the second block (block_81)
> 4) Rewrite the INode with new contents [block_99, block_81]
> 5) Write the last block (block_-101)
> 6) Rewrite the INode with the final contents [block_99, block_81, block_-101]
> Copy Output to Final Location (ReduceTask#copyOutput):
> 1) Read the INode contents from 
> /myjob/_temporary/_attempt_200907142159_0306_r_000133_0/part-00133.gz, which 
> gives [block_99, block_81, block_-101]
> 2) Write the data from #1 to the final location, /myjob/part-00133.gz
> 3) Delete the old INode 
> The output file is truncated if S3 serves a stale copy of the temporary 
> INode.  In copyOutput, step 1 above, it is possible for S3 to return a 
> version of the temporary INode that contains just [block_99, block_81].  In 
> this case, we write this new data to the final output location, and 'lose' 
> block_-101 in the process.  Since we then delete the temporary INode, we've 
> lost all references to the final block of this file and it's orphaned in the 
> S3 bucket.
> This type of consistency error is infrequent but not impossible. We've 
> observed these failures about once a week for one of our large jobs which 
> runs daily and has 200 reduce outputs; so we're seeing an error rate of 
> something like 0.07% per reduce.
> These kind of errors are generally difficult to handle in a system like S3.  
> We have a few ideas about how to fix this:
> 1) HACK! Sleep during S3OutputStream#close or #flush to wait for S3 to catch 
> up and make these less likely.
> 2) Poll for updated MD5 or INode data in Jets3tFileSystemStore#storeINode 
> until S3 says the INode contents are the same as our local copy.  This could 
> be a config option - "fs.s3.verifyInodeWrites" or something like that.
> 3) Cache INode contents in-process, so we don't have to go back to S3 to ask 
> for the current version of an INode.
> 4) Only write INodes once, when the output stream is closed.  This would 
> basically make S3OutputStream#flush() a no-op.
> 5) Modify the S3FS to somehow version INodes (unclear how we would do this, 
> need some design work).
> 6) Avoid using the S3FS for temporary task attempt files.
> 7) Avoid using the S3FS completely.
> We wanted to get some guidance from the community before we went down any of 
> these paths.  Has anyone seen this issue?  Any other suggested workarounds?  
> We at tracksimple are willing to invest some time in fixing this and (of 
> course) contributing our fix back, but we wanted to get an 'ack' from others 
> before we try anything crazy :-).
> I've attached a test app if anyone wants to try and reproduce this 
> themselves.  It takes a while to run (depending on the 'weather' in S3 right 
> now), but should eventually detect a consistency 'error' that manifests 
> itself as a truncated file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to