[ 
https://issues.apache.org/jira/browse/HADOOP-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16778565#comment-16778565
 ] 

Ben Roling commented on HADOOP-15625:
-------------------------------------

[[email protected]] - something I noticed with your updates is that the first 
read() on an S3AInputStream that detects a change will throw 
RemoteFileChangedException, but subsequent read() calls on the same stream will 
not.

This seems unexpected to me, but it appears as though you _may_ have done that 
intentionally?

My thinking is that you should need to explicitly open a new stream 
(FileSystem.open()) to get past the RemoteFileChangedException condition.

This behavior change is only occurring when mode=client and is evidenced by 
failing ITestS3ARemoteFileChanged tests.  It occurs because ChangeTracker moves 
to the new revision 
[here|https://github.com/steveloughran/hadoop/commit/5cf5d79fc9c5a6e256fa231b21731bd3219079bf#diff-c97e625906bcf378a192c522739e67baR167].

I'm inferring that this may have been intentional due to 
TestStreamChangeTracker.testEtagCheckingWarn(), which asserts a second mismatch 
is not counted 
[here|https://github.com/steveloughran/hadoop/commit/5cf5d79fc9c5a6e256fa231b21731bd3219079bf#diff-ad8ffc56d9d28ed3972d9a8d5efa1814R90].
  Maybe I shouldn't read too much into that.  You may have had a desire to only 
log once (when mode=warn), not realizing the change also causes an exception to 
only be thrown once.  I'm inclined to think it would be acceptable to warn 
multiple times though (as many times as you'd see RemoteFileChangedException in 
the client and server modes).  What do you think?

> S3A input stream to use etags to detect changed source files
> ------------------------------------------------------------
>
>                 Key: HADOOP-15625
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15625
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Brahma Reddy Battula
>            Assignee: Brahma Reddy Battula
>            Priority: Major
>         Attachments: HADOOP--15625-006.patch, HADOOP-15625-001.patch, 
> HADOOP-15625-002.patch, HADOOP-15625-003.patch, HADOOP-15625-004.patch, 
> HADOOP-15625-005.patch, HADOOP-15625-006.patch
>
>
> S3A input stream doesn't handle changing source files any better than the 
> other cloud store connectors. Specifically: it doesn't noticed it has 
> changed, caches the length from startup, and whenever a seek triggers a new 
> GET, you may get one of: old data, new data, and even perhaps go from new 
> data to old data due to eventual consistency.
> We can't do anything to stop this, but we could detect changes by
> # caching the etag of the first HEAD/GET (we don't get that HEAD on open with 
> S3Guard, BTW)
> # on future GET requests, verify the etag of the response
> # raise an IOE if the remote file changed during the read.
> It's a more dramatic failure, but it stops changes silently corrupting things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to