[
https://issues.apache.org/jira/browse/HBASE-20431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441521#comment-16441521
]
Andrew Purtell commented on HBASE-20431:
----------------------------------------
Description subject to change as this idea is thought through.
> Store commit transaction for filesystems that do not support an atomic rename
> -----------------------------------------------------------------------------
>
> Key: HBASE-20431
> URL: https://issues.apache.org/jira/browse/HBASE-20431
> Project: HBase
> Issue Type: Sub-task
> Reporter: Andrew Purtell
> Priority: Major
>
> HBase expects the Hadoop filesystem implementation to support an atomic
> rename() operation. HDFS does. The S3 backed filesystems do not. The
> fundamental issue is the non-atomic and eventually consistent nature of the
> S3 service. A S3 bucket is not a filesystem. S3 is not always immediately
> read-your-writes. Object metadata can be temporarily inconsistent just after
> new objects are stored. There can be a settling period to ride over.
> Renaming/moving objects from one path to another are copy operations with
> O(file) complexity and O(data) time followed by a series of deletes with
> O(file) complexity. Failures at any point prior to completion will leave the
> operation in an inconsistent state. The missing atomic rename semantic opens
> opportunities for corruption and data loss, which may or may not be
> repairable with HBCK.
> Handling this at the HBase level could be done with a new multi-step
> filesystem transaction framework. Call it StoreCommitTransaction.
> SplitTransaction and MergeTransaction are well established cases where even
> on HDFS we have non-atomic filesystem changes and are our implementation
> template for the new work. In this new StoreCommitTransaction we'd be moving
> flush and compaction temporaries out of the temporary directory into the
> region store directory. On HDFS the implementation would be easy. We can rely
> on the filesystem's atomic rename semantics. On S3 it would be work: First we
> would build the list of objects to move, then copy each object into the
> destination, and then finally delete all objects at the original path. We
> must handle transient errors with retry strategies appropriate for the action
> at hand. We must handle serious or permanent errors where the RS doesn't need
> to be aborted with a rollback that cleans it all up. Finally, we must handle
> permanent errors where the RS must be aborted with a rollback during region
> open/recovery. Note that after all objects have been copied and we are
> deleting obsolete source objects we must roll forward, not back. To support
> recovery after an abort we must utilize the WAL to track transaction
> progress. Put markers in for StoreCommitTransaction start and completion
> state, with details of the store file(s) involved, so it can be rolled back
> during region recovery at open. This will be significant work in HFile,
> HStore, flusher, compactor, and HRegion. Wherever we use HDFS's rename now we
> would substitute the running of this new multi-step filesystem transaction.
> We need to determine this for certain, but I believe on S3 the PUT or
> multipart upload of an object must complete before the object is visible, so
> we don't have to worry about the case where an object is visible before fully
> uploaded as part of normal operations. So an individual object copy will
> either happen entirely and the target will then become visible, or it won't
> and the target won't exist.
> S3 has an optimization, PUT COPY
> (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html), which
> the AmazonClient embedded in S3A utilizes for moves. When designing the
> StoreCommitTransaction be sure to allow for filesystem implementations that
> leverage a server side copy operation. Doing a get-then-put should be
> optional. (Not sure Hadoop has an interface that advertises this capability
> yet; we can add one if not.)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)