[ 
https://issues.apache.org/jira/browse/HADOOP-16085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820336#comment-16820336
 ] 

Ben Roling commented on HADOOP-16085:
-------------------------------------

bq. Have been discussing with Steve Loughran whether or not the FileStatus -> 
S3AFileStatus and schema changes should be separated out from the enforcement. 
I think the best argument for that is that it's a smaller change to get older 
clients to notify newer clients of changes whereas only the newer ones will 
enforce. The other factor mentioned is the desire for keeping S3Guard 
relatively storage-agnostic, but I honestly just don't see how we can do that 
and still have a robust solution. S3 is popular enough to warrant a custom 
solution that really does fix all the holes. Personally, I think we should just 
keep this change together.

I'm going to leave this one alone.  If you guys feel strongly about it let me 
know.

bq. I don't suppose there's an interface we can rely on to provide getETag() 
and getVersionId(), is there? This is where Go's duck-typing would be nice so 
we could eliminate 2 (or more) or the args to every constructor call.

I'm not aware of one.  I'll leave it alone unless you guys want me to create a 
class package those two attributes (suggested name welcome if so).  There are 
definitely times I wish I was working in an environment with duck-typing.

bq. I am also getting some unit test failures running this in CDH. Will do some 
more test runs on the upstream base and with various parameters to see if I can 
narrow it down. I assume you've been running the tests with no problems?

Which tests are failing and how are you running on CDH?  I've been running the 
test suite on both my PR branch as well as on an internal branch where I 
patched the changes back into CDH5 with success.

> S3Guard: use object version or etags to protect against inconsistent read 
> after replace/overwrite
> -------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-16085
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16085
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Ben Roling
>            Assignee: Ben Roling
>            Priority: Major
>         Attachments: HADOOP-16085-003.patch, HADOOP-16085_002.patch, 
> HADOOP-16085_3.2.0_001.patch
>
>
> Currently S3Guard doesn't track S3 object versions.  If a file is written in 
> S3A with S3Guard and then subsequently overwritten, there is no protection 
> against the next reader seeing the old version of the file instead of the new 
> one.
> It seems like the S3Guard metadata could track the S3 object version.  When a 
> file is created or updated, the object version could be written to the 
> S3Guard metadata.  When a file is read, the read out of S3 could be performed 
> by object version, ensuring the correct version is retrieved.
> I don't have a lot of direct experience with this yet, but this is my 
> impression from looking through the code.  My organization is looking to 
> shift some datasets stored in HDFS over to S3 and is concerned about this 
> potential issue as there are some cases in our codebase that would do an 
> overwrite.
> I imagine this idea may have been considered before but I couldn't quite 
> track down any JIRAs discussing it.  If there is one, feel free to close this 
> with a reference to it.
> Am I understanding things correctly?  Is this idea feasible?  Any feedback 
> that could be provided would be appreciated.  We may consider crafting a 
> patch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to