[ 
https://issues.apache.org/jira/browse/HADOOP-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771359#comment-16771359
 ] 

Ben Roling commented on HADOOP-15625:
-------------------------------------

Thanks [[email protected]].  I'm not taking it personally.  I understood you 
likely had many other things to do and were probably busy.  I'll try to be 
respectful of your time.  Let me know if there are ways I can be more helpful.

I'll get a new patch uploaded once you can address my questions about the 
exception type (whether to really use a subclass of EOFException) and the 
larger-to-shorter test.

bq. One thing I'have been wondering is how third party stores are going to 
handle that modified header, and what could we do here.

I don't have experience with any third party stores yet.  Is there any 
documentation for me to look at about interacting with third party stores via 
S3AFileSystem?  I need to do a little homework I guess.  Maybe I can figure out 
how to try one or two of them out.

I tend to think doing the etag check solely on the client side would be the 
simplest solution.  We don't need to bother trying to use the 
withMatchingETagConstraint() method, which the third party store may or may not 
support.  I suppose there is a possibility that GetObject itself will never 
return an eTag for some third party stores.  I propose we'd simply log a 
warning in this case.  Obviously, we won't be able to detect a 
read-during-overwrite inconsistency in this scenario.  If you prefer, we can 
even put in a config key to make it possible to disable this warning, although 
I wouldn't expect it to be seen so often as to be offensive to users of such 
third-party stores (assuming such stores actually exist).

We can have some general documentation about read-during-overwrite with 
S3AFileSystem and talk about the protection we have here as well as the 
limitations, calling out that it may not be supported for third-party stores 
and noting the warning message that will be seen in such cases.  We can even 
document that it is (or is not) supported on specific third-party stores if we 
actually know.

Adding the suggested metric is another good idea.  I'm happy to put that in my 
patch.

> S3A input stream to use etags to detect changed source files
> ------------------------------------------------------------
>
>                 Key: HADOOP-15625
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15625
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Brahma Reddy Battula
>            Assignee: Brahma Reddy Battula
>            Priority: Major
>         Attachments: HADOOP-15625-001.patch, HADOOP-15625-002.patch, 
> HADOOP-15625-003.patch
>
>
> S3A input stream doesn't handle changing source files any better than the 
> other cloud store connectors. Specifically: it doesn't noticed it has 
> changed, caches the length from startup, and whenever a seek triggers a new 
> GET, you may get one of: old data, new data, and even perhaps go from new 
> data to old data due to eventual consistency.
> We can't do anything to stop this, but we could detect changes by
> # caching the etag of the first HEAD/GET (we don't get that HEAD on open with 
> S3Guard, BTW)
> # on future GET requests, verify the etag of the response
> # raise an IOE if the remote file changed during the read.
> It's a more dramatic failure, but it stops changes silently corrupting things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to