steveloughran opened a new pull request #843: HADOOP-15183 S3Guard store 
becomes inconsistent after partial failure of rename
URL: https://github.com/apache/hadoop/pull/843
 
 
   This is a major patch which has grown to become a "fix up the final issues 
with Auth mode" patch, built according to my proposed [refactoring 
S3A](https://github.com/steveloughran/engineering-proposals/blob/master/refactoring-s3a.md)
 design.
   
   * new stuff goes into an o.a.h.fs.s3a.impl class to make clear its for 
implementation
   
   ## Multi-object delete:
   *  `o.a.h.fs.s3a.impl.MultiObjectDeleteSupport` handles partial delete 
failures by parsing the response and updating the metastore with those deleted 
entries
   
   ## rename (i.e. copy + delete), plus scale and perf of commits
   
   Each metastore has the notion of
   * `BulkOperationState`; this can be used by a store to track entries which 
have been added/found during a set of operations (sequential or in parallel)
   * and a `RenameTracker` which implements the algorithm for updating the 
store on rename. The `DelayedUpdateTracker` implements the classic "do it at 
the end" algorithm, while the `ProgressiveRenameTracker` does it incrementally. 
   * and their own `addAncestors()` call which can update the bulk state based 
on examining the store and the current state of the `BulkOperationState`. This 
has been pushed down from `S3Guard.addAncestors()` to allow them to do their 
own thing with the bulk state.
   
   `S3AFilesystem.innerRename` now copies in parallel; hard coded at 10 for now 
(having it a factor of the delete page size makes sense). each parallel copy 
operation notifies the current `RenameTracker` of its completion, so letting 
them choose what to do (`DelayedUpdate`: saves the change; 
`ProgressiveRenameTracker`: updates the store immediately). 
   
   The RenameTracker also has the homework of updating the store state on the 
case of a partial rename failure. `DelayedUpdateTracker`: update the metastore; 
ProgressiveUpdate: no-op.
   
   Local and DDB metastores now only use the `ProgressiveRenameTracker`; this 
was done after I'd move the classic design into its own tracker (i.e out of 
S3AFileSystem). Oh, and theres a `NullRenameTracker` for the null metastore.
   
   The BulkOperationState ends up being passed all the way to the committers; 
to avoid them explicitly having to understand this, the `CommitOperations` 
class now has an inner (non-static) `CommitContext` class which contains this 
and exports the commit/abort/revert operations for the committers to call. 
Internally it then invokes WriteOperationHelper calls with the context, which 
then gets all the way through to `S3AFileSystem.finishedWrite()` for the 
updates there to decide what entries to put into the store.
   
   With this design
   * Renames are progressive and parallelized. So faster as well as keeping the 
the metastore consistent.
   * Bulk delete failures are handled
   * Commit operations don't generate excessive load on the store: if you are 
committing 500 files, then each parent dir will be checked ~once. (I say ~once 
as the get() is done out of a sync block, so multiple parallel entries may 
duplicate the work, but I'd rather that than lock during a remote RPC call).
   
   Note that for both the parallelized commit and copy ops, we can also shuffle 
the source data so that we aren't doing it on the same lowest-level directory 
(what I'd expect today). That should reduce load on the S3 Store. With a move 
to parallel copy operations that will make a difference: the list comes in 
sequentially so the initial set of 10 renames will be adjacent. Actually I 
could be even cleverer there and sort by size. Why so? ensures that a large 
file doesn't become the straggler in the copy.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to