[ https://issues.apache.org/jira/browse/HADOOP-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16566024#comment-16566024 ]
Steve Loughran commented on HADOOP-15625: ----------------------------------------- -1 you have getObjectMetadata() in open(), so added another HTTP round trip; extra on top of the already slow open sequence without S3Guard, and an HTTP request where one doesn't exist with S3Guard # in S3AInputStream.reopen(), the first GET should return that etag # which can be stored in the S3AInputStream field # After that, subsequent reopens can validate the etag returned, or include that tag in an unless-modified header in the call. (BTW, some issues related to etag checking [may exist in the SDK|https://github.com/aws/aws-sdk-java/issues/1211]. This won't pick up a change in the source file between open() returning and the first read/readfully, but it will ensure that later changes will fail fast bq. Ran the tests after this changes,there were failures related to this ( seen some encryption errors). # make sure you've got all the s3 tests running before trying to change things, and that includes the s3guard stuff # state which specific S3 endpoint you've been playing with # if there were failures, list the tests which failed, as it could be new stuff. For example, encryption & etags may be a source of surprises (I don't think it will, but still..) > S3A input stream to use etags to detect changed source files > ------------------------------------------------------------ > > Key: HADOOP-15625 > URL: https://issues.apache.org/jira/browse/HADOOP-15625 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 3.2.0 > Reporter: Brahma Reddy Battula > Priority: Major > Attachments: HADOOP-15625-001.patch > > > S3A input stream doesn't handle changing source files any better than the > other cloud store connectors. Specifically: it doesn't noticed it has > changed, caches the length from startup, and whenever a seek triggers a new > GET, you may get one of: old data, new data, and even perhaps go from new > data to old data due to eventual consistency. > We can't do anything to stop this, but we could detect changes by > # caching the etag of the first HEAD/GET (we don't get that HEAD on open with > S3Guard, BTW) > # on future GET requests, verify the etag of the response > # raise an IOE if the remote file changed during the read. > It's a more dramatic failure, but it stops changes silently corrupting things. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org