steveloughran commented on issue #654: HADOOP-15183 S3Guard store becomes inconsistent after partial failure of rename URL: https://github.com/apache/hadoop/pull/654#issuecomment-490218551 This patch is now of a state where it is ready for review * it's going to have to be changed to keep up with the S3Guard versioning patches so I'm hoping to nurture those in, but the incompatibilities are related to the type of FileStatus passed around & general git merge problems, rather than functional conflict. There's one production side improvement I'd like to add. This new patch does the move incrementally: whenever you add a file we call s3guard.move(null, dest-file-status) to add the destination (and ancestors), on a bulk delete we update the deletes, But: that move(List, List) call creates all the parent paths, relying on a hash table to avoid duplicates,. Once you move to single-file additions then both that and metastore.put() are creating too many entries due to their need to meet the goal of "no duplicates". I want to restore the original behavior by passing in to the metastore the map being built up in the rename tracker, so it knows what already exists. (Note: this all needs to be done thread safely, so that when > 1 copy completes...I don't want the locks for that to also block other updates to the metastore) This isn't a functionality change, it's a performance and cost improvement, one designed to keep those DDB write IOPs down. ## Please take a look at the code as it stands. The architecture is based on my [refactoring S3A](https://github.com/steveloughran/engineering-proposals/blob/master/refactoring-s3a.md) doc -the new classes are designed to work with the new `StoreContext` class; the metastore moves with this.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org