steveloughran commented on issue #654: HADOOP-15183 S3Guard store becomes 
inconsistent after partial failure of rename
URL: https://github.com/apache/hadoop/pull/654#issuecomment-490218551
 
 
   This patch is now of a state where it is ready for review
   
   * it's going to have to be changed to keep up with the S3Guard versioning 
patches so I'm hoping to nurture those in, but the incompatibilities are 
related to the type of FileStatus passed around & general git merge problems, 
rather than functional conflict.
   
   There's one production side improvement I'd like to add.
   
   This new patch does the move incrementally: whenever you add a file we call 
s3guard.move(null, dest-file-status) to add the destination (and ancestors), on 
a bulk delete we update the deletes, 
   
   But: that move(List, List) call creates all the parent paths, relying on a 
hash table to avoid duplicates,. Once you move to single-file additions then 
both that and metastore.put() are creating too many entries due to their need 
to meet the goal of "no duplicates". I want to restore the original behavior by 
passing in to the metastore the map being built up in the rename tracker, so it 
knows what already exists. (Note: this all needs to be done thread safely, so 
that when > 1 copy completes...I don't want the locks for that to also block 
other updates to the metastore)
   
   This isn't a functionality change, it's a performance and cost improvement, 
one designed to keep those DDB write IOPs down.
   
   ## Please take a look at the code as it stands.
   
   The architecture is based on my [refactoring 
S3A](https://github.com/steveloughran/engineering-proposals/blob/master/refactoring-s3a.md)
 doc -the new classes are designed to work with the new `StoreContext` class; 
the metastore moves with this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to