steveloughran opened a new pull request #843: HADOOP-15183 S3Guard store becomes inconsistent after partial failure of rename URL: https://github.com/apache/hadoop/pull/843 This is a major patch which has grown to become a "fix up the final issues with Auth mode" patch, built according to my proposed [refactoring S3A](https://github.com/steveloughran/engineering-proposals/blob/master/refactoring-s3a.md) design. * new stuff goes into an o.a.h.fs.s3a.impl class to make clear its for implementation ## Multi-object delete: * `o.a.h.fs.s3a.impl.MultiObjectDeleteSupport` handles partial delete failures by parsing the response and updating the metastore with those deleted entries ## rename (i.e. copy + delete), plus scale and perf of commits Each metastore has the notion of * `BulkOperationState`; this can be used by a store to track entries which have been added/found during a set of operations (sequential or in parallel) * and a `RenameTracker` which implements the algorithm for updating the store on rename. The `DelayedUpdateTracker` implements the classic "do it at the end" algorithm, while the `ProgressiveRenameTracker` does it incrementally. * and their own `addAncestors()` call which can update the bulk state based on examining the store and the current state of the `BulkOperationState`. This has been pushed down from `S3Guard.addAncestors()` to allow them to do their own thing with the bulk state. `S3AFilesystem.innerRename` now copies in parallel; hard coded at 10 for now (having it a factor of the delete page size makes sense). each parallel copy operation notifies the current `RenameTracker` of its completion, so letting them choose what to do (`DelayedUpdate`: saves the change; `ProgressiveRenameTracker`: updates the store immediately). The RenameTracker also has the homework of updating the store state on the case of a partial rename failure. `DelayedUpdateTracker`: update the metastore; ProgressiveUpdate: no-op. Local and DDB metastores now only use the `ProgressiveRenameTracker`; this was done after I'd move the classic design into its own tracker (i.e out of S3AFileSystem). Oh, and theres a `NullRenameTracker` for the null metastore. The BulkOperationState ends up being passed all the way to the committers; to avoid them explicitly having to understand this, the `CommitOperations` class now has an inner (non-static) `CommitContext` class which contains this and exports the commit/abort/revert operations for the committers to call. Internally it then invokes WriteOperationHelper calls with the context, which then gets all the way through to `S3AFileSystem.finishedWrite()` for the updates there to decide what entries to put into the store. With this design * Renames are progressive and parallelized. So faster as well as keeping the the metastore consistent. * Bulk delete failures are handled * Commit operations don't generate excessive load on the store: if you are committing 500 files, then each parent dir will be checked ~once. (I say ~once as the get() is done out of a sync block, so multiple parallel entries may duplicate the work, but I'd rather that than lock during a remote RPC call). Note that for both the parallelized commit and copy ops, we can also shuffle the source data so that we aren't doing it on the same lowest-level directory (what I'd expect today). That should reduce load on the S3 Store. With a move to parallel copy operations that will make a difference: the list comes in sequentially so the initial set of 10 renames will be adjacent. Actually I could be even cleverer there and sort by size. Why so? ensures that a large file doesn't become the straggler in the copy.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
