[ 
https://issues.apache.org/jira/browse/HADOOP-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771191#comment-16771191
 ] 

Steve Loughran commented on HADOOP-15625:
-----------------------------------------

thanks, ben, not had a chance to. running regressions test related to 
HADOOP-15843 in one window; backporting HADOOP-15281 to hadoop 3.1+, 
backporting a lot of ABFS changes to some other branch, oh, and when I get a 
chance doing my own coding (HADOOP-16068)

please don't take this personally.

One thing I'have been wondering is how third party stores are going to handle 
that modified header, and what could we do here. Ignoring the "this adds even 
more tests and documentation" problem, I could imagine multiple options here 
for some fs.s3a.etag.checks

* server: we do it server-side
* client: do it on the client, which fails on a returned value. deal with 
stores which don't support the etag
* warn: simply downgrade to warn
* off: don't check

what do you think?

Oh, and we can add more metrics to the 
org.apache.hadoop.fs.s3a.S3AInstrumentation.InputStreamStatistics class to 
count number of times an inconsistency was observed. this could help 
monitoring/debugging across an entire cluster


> S3A input stream to use etags to detect changed source files
> ------------------------------------------------------------
>
>                 Key: HADOOP-15625
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15625
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Brahma Reddy Battula
>            Assignee: Brahma Reddy Battula
>            Priority: Major
>         Attachments: HADOOP-15625-001.patch, HADOOP-15625-002.patch, 
> HADOOP-15625-003.patch
>
>
> S3A input stream doesn't handle changing source files any better than the 
> other cloud store connectors. Specifically: it doesn't noticed it has 
> changed, caches the length from startup, and whenever a seek triggers a new 
> GET, you may get one of: old data, new data, and even perhaps go from new 
> data to old data due to eventual consistency.
> We can't do anything to stop this, but we could detect changes by
> # caching the etag of the first HEAD/GET (we don't get that HEAD on open with 
> S3Guard, BTW)
> # on future GET requests, verify the etag of the response
> # raise an IOE if the remote file changed during the read.
> It's a more dramatic failure, but it stops changes silently corrupting things.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to