[
https://issues.apache.org/jira/browse/HADOOP-15999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804150#comment-16804150
]
Hudson commented on HADOOP-15999:
---------------------------------
FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #16299 (See
[https://builds.apache.org/job/Hadoop-trunk-Commit/16299/])
HADOOP-15999. S3Guard: Better support for out-of-band operations. (stevel: rev
b5db2383832881034d57d836a8135a07a2bd1cf4)
* (edit)
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java
* (add)
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3GuardOutOfBandOperations.java
* (edit)
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
* (edit) hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/s3guard.md
* (edit)
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/s3guard/S3Guard.java
* (edit) hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md
* (edit)
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/contract/AbstractContractGetFileStatusTest.java
> S3Guard: Better support for out-of-band operations
> --------------------------------------------------
>
> Key: HADOOP-15999
> URL: https://issues.apache.org/jira/browse/HADOOP-15999
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 3.1.0
> Reporter: Sean Mackrory
> Assignee: Gabor Bota
> Priority: Major
> Fix For: 3.3.0
>
> Attachments: HADOOP-15999-007.patch, HADOOP-15999.001.patch,
> HADOOP-15999.002.patch, HADOOP-15999.003.patch, HADOOP-15999.004.patch,
> HADOOP-15999.005.patch, HADOOP-15999.006.patch, HADOOP-15999.008.patch,
> HADOOP-15999.009.patch, out-of-band-operations.patch
>
>
> S3Guard was initially done on the premise that a new MetadataStore would be
> the source of truth, and that it wouldn't provide guarantees if updates were
> done without using S3Guard.
> I've been seeing increased demand for better support for scenarios where
> operations are done on the data that can't reasonably be done with S3Guard
> involved. For example:
> * A file is deleted using S3Guard, and replaced by some other tool. S3Guard
> can't tell the difference between the new file and delete / list
> inconsistency and continues to treat the file as deleted.
> * An S3Guard-ed file is overwritten by a longer file by some other tool. When
> reading the file, only the length of the original file is read.
> We could possibly have smarter behavior here by querying both S3 and the
> MetadataStore (even in cases where we may currently only query the
> MetadataStore in getFileStatus) and use whichever one has the higher modified
> time.
> This kills the performance boost we currently get in some workloads with the
> short-circuited getFileStatus, but we could keep it with authoritative mode
> which should give a larger performance boost. At least we'd get more
> correctness without authoritative mode and a clear declaration of when we can
> make the assumptions required to short-circuit the process. If we can't
> consider S3Guard the source of truth, we need to defer to S3 more.
> We'd need to be extra sure of any locality / time zone issues if we start
> relying on mod_time more directly, but currently we're tracking the
> modification time as returned by S3 anyway.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]