[
https://issues.apache.org/jira/browse/HADOOP-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16714130#comment-16714130
]
Steve Loughran commented on HADOOP-15625:
-----------------------------------------
-1 as is
Main issues
assumes status passed in to stream always as a tag
h3. Production
requires etag to always be non-null; this doesn't hold if s3guard is providing
listings.
Fix:
* handle null tag in stream constructor
* and only add the constraint if non null.
* collect and cache on the first read, (and handle null there, for third party
stores).
* after the get, split error checks from null response vs different etag tested
for on client
Other changes
* FileStatus is already passed in to InputStream in {{S3AReadOpContext}}; no
need to pass in again.
* Fix: move to that, changing S3AFileStatus as type in context & return value
from getFileStatus in the process
* add package-scoped {{S3AInputStream.getETag()}} for testing
* New Exception is a subclass of PathIOException. Put it in ex/ sub package as
too many exceptions are filling up the base package for its own good.
*
h3. Tests
* Test in wrong place for an integration test (i.e. a Test* class)
* didn't have the @test tag, so wasn't actually running
* couldn't now handle s3guard file status as the file update was happening
before any initial read: there would have been no etag.
Fix:
* move to ITestS3AChangesToStreams (which is a renaming of a test suite which
only checked for
* Also: change test to handle s3guard run where there was no initial etag
* tests play games with seek() to force close/reopen with a guarantee that
we've already got an etag
Also add checks for
* overwrite dest with exactly the same dataset. For AWS this doesn't change the
dataset; be interesting to see what others do (currently: caught and printed,
but we could consider escalating)
* take the existing delete of open stream test, move to lambda-expressions, add
second test to verify that it's not picked up on a read() of an already open
stream which had an initial read()...the GET has done it
This has got enough change on it that we're going to have to seek some extra
reviewers. Be nice to test with some other other object stores
> S3A input stream to use etags to detect changed source files
> ------------------------------------------------------------
>
> Key: HADOOP-15625
> URL: https://issues.apache.org/jira/browse/HADOOP-15625
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 3.2.0
> Reporter: Brahma Reddy Battula
> Assignee: Brahma Reddy Battula
> Priority: Major
> Attachments: HADOOP-15625-001.patch, HADOOP-15625-002.patch,
> HADOOP-15625-003.patch
>
>
> S3A input stream doesn't handle changing source files any better than the
> other cloud store connectors. Specifically: it doesn't noticed it has
> changed, caches the length from startup, and whenever a seek triggers a new
> GET, you may get one of: old data, new data, and even perhaps go from new
> data to old data due to eventual consistency.
> We can't do anything to stop this, but we could detect changes by
> # caching the etag of the first HEAD/GET (we don't get that HEAD on open with
> S3Guard, BTW)
> # on future GET requests, verify the etag of the response
> # raise an IOE if the remote file changed during the read.
> It's a more dramatic failure, but it stops changes silently corrupting things.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]