[
https://issues.apache.org/jira/browse/HADOOP-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927058#comment-15927058
]
Steve Loughran edited comment on HADOOP-13786 at 3/15/17 9:55 PM:
------------------------------------------------------------------
Patch 013.
Becoming ready for review, the merge is done as far as core functionality
through a direct s3 connection is concerned. I've migrated Ryan's code and
fixed his tests and mine.
# Unified the commit logic between staging committers; there's an explicit
{{preCommit(context, pending)}} in {{StagingS3GuardCommitter}} for subclasses
to override (Base: no-op); if the precommit fails, pending ops are rolled back
(existing code in {{PartitionStagingCommitter.}}
# added subdirectory support to the DirectoryStagingCommitter
# added option to create _SUCCESS markers in these commiters, default is true,
as per the {{FileOutputCommitter}}
# some more tuning in the conflict resolution to trim the number of S3 calls.
Every getFileStatus/exists call is sacred
# all tests are passing
# all integration tests are passing
TODO
* More tests needed, obviously. I'm thinking of some scale ones with many files
* some metrics; it'd be good to know the number of files and amount of bytes
uploaded in commits. This is implicitly measured, but not called out. Knowing
bytes uploaded in commit will show impact of commit. If compared with
{{files_copied_bytes}} (done in S3) you can start to estimate impact of
operations.
* switch to {{FileCommitActions}} and the S3a methods for the s3 ops. This will
require us to mock all that stuff too; still thinking of best way.
* the MPU commits to update s3guard
As an aside, because MPUs have request IDs which can be cancelled explicitly,
it could be possible for a future committer to actually write direct to S3,
saving the pending data to the committer directory in HDFS. This would blur
the magic committer with the staging one; just rely on HDFS to implement the
consistent commit logic through renames, and so use s3 as a destination with or
without consistency there. I think you'd still need the _magic directory
though, so the working dir would trigger the special output operation.
was (Author: [email protected]):
Patch 013.
# Unified the commit logic between staging committers; there's an explicit
{{preCommit(context, pending)}} in {{StagingS3GuardCommitter}} for subclasses
to override (Base: no-op); if the precommit fails, pending ops are rolled back
(existing code in {{PartitionStagingCommitter.}}
# added subdirectory support to the DirectoryStagingCommitter
# added option to create _SUCCESS markers in these commiters, default is true,
as per the {{FileOutputCommitter}}
# all tests are passing
# all integration tests are passing
TODO
* More tests needed, obviously. I'm thinking of some scale ones with many files
* some metrics; it'd be good to know the number of files and amount of bytes
uploaded in commits. This is implicitly measured, but not called out. Knowing
bytes uploaded in commit will show impact of commit. If compared with
{{files_copied_bytes}} (done in S3) you can start to estimate impact of
operations.
As an aside, because MPUs have request IDs which can be cancelled explicitly,
it could be possible for a future committer to actually write direct to S3,
saving the pending data to the committer directory in HDFS. This would blur
the magic committer with the staging one; just rely on HDFS to implement the
consistent commit logic through renames, and so use s3 as a destination with or
without consistency there. I think you'd still need the _magic directory
though, so the working dir would trigger the special output operation.
> Add S3Guard committer for zero-rename commits to consistent S3 endpoints
> ------------------------------------------------------------------------
>
> Key: HADOOP-13786
> URL: https://issues.apache.org/jira/browse/HADOOP-13786
> Project: Hadoop Common
> Issue Type: New Feature
> Components: fs/s3
> Affects Versions: HADOOP-13345
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Attachments: HADOOP-13786-HADOOP-13345-001.patch,
> HADOOP-13786-HADOOP-13345-002.patch, HADOOP-13786-HADOOP-13345-003.patch,
> HADOOP-13786-HADOOP-13345-004.patch, HADOOP-13786-HADOOP-13345-005.patch,
> HADOOP-13786-HADOOP-13345-006.patch, HADOOP-13786-HADOOP-13345-006.patch,
> HADOOP-13786-HADOOP-13345-007.patch, HADOOP-13786-HADOOP-13345-009.patch,
> HADOOP-13786-HADOOP-13345-010.patch, HADOOP-13786-HADOOP-13345-011.patch,
> HADOOP-13786-HADOOP-13345-012.patch, HADOOP-13786-HADOOP-13345-013.patch,
> s3committer-master.zip
>
>
> A goal of this code is "support O(1) commits to S3 repositories in the
> presence of failures". Implement it, including whatever is needed to
> demonstrate the correctness of the algorithm. (that is, assuming that s3guard
> provides a consistent view of the presence/absence of blobs, show that we can
> commit directly).
> I consider ourselves free to expose the blobstore-ness of the s3 output
> streams (ie. not visible until the close()), if we need to use that to allow
> us to abort commit operations.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]