[ 
https://issues.apache.org/jira/browse/HADOOP-15999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743062#comment-16743062
 ] 

Steve Loughran commented on HADOOP-15999:
-----------------------------------------

its not quite ready yet. 

That s3 get file status can be lot more minimal if the existence of a file is 
known in DDB and a check to check its newness is to be executed. We don't need 
any check for directories here; a simple HEAD/getObject metadata is enough. If 
there isn't a file, don't bother with HEAD path +"/" or LIST, which could save 
a few hundred millis.

We also want the etag for HADOOP-15625...I'd like to get that patch in before 
this one, but I'd also like that one to be done after HADOOP-15229.

Now, thinking of that file opening, when we are opening a file, we don't need 
to do a HEAD of that file, because the moment the GET is kicked off (or the 
SELECT), that will do it...and we already have an invoker in S3aInputStream 
which knows to retry. And in HADOOP-15625 that first GET will give the etag 
which is required to be constant on later reads.

so....we should distinguish "getFileStatus() as part of the public API" from 
internal file-existence-checks where we known the immediately subsequent call 
will be doing the equivalent anyway. 

> [s3a] Better support for out-of-band operations
> -----------------------------------------------
>
>                 Key: HADOOP-15999
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15999
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.1.0
>            Reporter: Sean Mackrory
>            Assignee: Gabor Bota
>            Priority: Major
>         Attachments: HADOOP-15999.001.patch, out-of-band-operations.patch
>
>
> S3Guard was initially done on the premise that a new MetadataStore would be 
> the source of truth, and that it wouldn't provide guarantees if updates were 
> done without using S3Guard.
> I've been seeing increased demand for better support for scenarios where 
> operations are done on the data that can't reasonably be done with S3Guard 
> involved. For example:
> * A file is deleted using S3Guard, and replaced by some other tool. S3Guard 
> can't tell the difference between the new file and delete / list 
> inconsistency and continues to treat the file as deleted.
> * An S3Guard-ed file is overwritten by a longer file by some other tool. When 
> reading the file, only the length of the original file is read.
> We could possibly have smarter behavior here by querying both S3 and the 
> MetadataStore (even in cases where we may currently only query the 
> MetadataStore in getFileStatus) and use whichever one has the higher modified 
> time.
> This kills the performance boost we currently get in some workloads with the 
> short-circuited getFileStatus, but we could keep it with authoritative mode 
> which should give a larger performance boost. At least we'd get more 
> correctness without authoritative mode and a clear declaration of when we can 
> make the assumptions required to short-circuit the process. If we can't 
> consider S3Guard the source of truth, we need to defer to S3 more.
> We'd need to be extra sure of any locality / time zone issues if we start 
> relying on mod_time more directly, but currently we're tracking the 
> modification time as returned by S3 anyway.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to