[
https://issues.apache.org/jira/browse/HADOOP-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771359#comment-16771359
]
Ben Roling commented on HADOOP-15625:
-------------------------------------
Thanks [[email protected]]. I'm not taking it personally. I understood you
likely had many other things to do and were probably busy. I'll try to be
respectful of your time. Let me know if there are ways I can be more helpful.
I'll get a new patch uploaded once you can address my questions about the
exception type (whether to really use a subclass of EOFException) and the
larger-to-shorter test.
bq. One thing I'have been wondering is how third party stores are going to
handle that modified header, and what could we do here.
I don't have experience with any third party stores yet. Is there any
documentation for me to look at about interacting with third party stores via
S3AFileSystem? I need to do a little homework I guess. Maybe I can figure out
how to try one or two of them out.
I tend to think doing the etag check solely on the client side would be the
simplest solution. We don't need to bother trying to use the
withMatchingETagConstraint() method, which the third party store may or may not
support. I suppose there is a possibility that GetObject itself will never
return an eTag for some third party stores. I propose we'd simply log a
warning in this case. Obviously, we won't be able to detect a
read-during-overwrite inconsistency in this scenario. If you prefer, we can
even put in a config key to make it possible to disable this warning, although
I wouldn't expect it to be seen so often as to be offensive to users of such
third-party stores (assuming such stores actually exist).
We can have some general documentation about read-during-overwrite with
S3AFileSystem and talk about the protection we have here as well as the
limitations, calling out that it may not be supported for third-party stores
and noting the warning message that will be seen in such cases. We can even
document that it is (or is not) supported on specific third-party stores if we
actually know.
Adding the suggested metric is another good idea. I'm happy to put that in my
patch.
> S3A input stream to use etags to detect changed source files
> ------------------------------------------------------------
>
> Key: HADOOP-15625
> URL: https://issues.apache.org/jira/browse/HADOOP-15625
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 3.2.0
> Reporter: Brahma Reddy Battula
> Assignee: Brahma Reddy Battula
> Priority: Major
> Attachments: HADOOP-15625-001.patch, HADOOP-15625-002.patch,
> HADOOP-15625-003.patch
>
>
> S3A input stream doesn't handle changing source files any better than the
> other cloud store connectors. Specifically: it doesn't noticed it has
> changed, caches the length from startup, and whenever a seek triggers a new
> GET, you may get one of: old data, new data, and even perhaps go from new
> data to old data due to eventual consistency.
> We can't do anything to stop this, but we could detect changes by
> # caching the etag of the first HEAD/GET (we don't get that HEAD on open with
> S3Guard, BTW)
> # on future GET requests, verify the etag of the response
> # raise an IOE if the remote file changed during the read.
> It's a more dramatic failure, but it stops changes silently corrupting things.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]