Andrew Purtell created HBASE-20431:
--------------------------------------

             Summary: Store commit transaction for filesystems that do not 
support an atomic rename
                 Key: HBASE-20431
                 URL: https://issues.apache.org/jira/browse/HBASE-20431
             Project: HBase
          Issue Type: Sub-task
            Reporter: Andrew Purtell


HBase expects the Hadoop filesystem implementation to support an atomic 
rename() operation. HDFS does. The S3 backed filesystems do not. The 
fundamental issue is the non-atomic and eventually consistent nature of the S3 
service. A S3 bucket is not a filesystem. S3 is not always immediately 
read-your-writes. Object metadata can be temporarily inconsistent just after 
new objects are stored. There can be a settling period to ride over. 
Renaming/moving objects from one path to another are copy operations with 
O(file) complexity and O(data) time followed by a series of deletes with 
O(file) complexity. Failures at any point prior to completion will leave the 
operation in an inconsistent state. The missing atomic rename semantic opens 
opportunities for corruption and data loss, which may or may not be repairable 
with HBCK.

Handling this at the HBase level could be done with a new multi-step filesystem 
transaction framework. Call it StoreCommitTransaction. SplitTransaction and 
MergeTransaction are well established cases where even on HDFS we have 
non-atomic filesystem changes and are our implementation template for the new 
work. In this new StoreCommitTransaction we'd be moving flush and compaction 
temporaries out of the temporary directory into the region store directory. On 
HDFS the implementation would be easy. We can rely on the filesystem's atomic 
rename semantics. On S3 it would be work: First we would build the list of 
objects to move, then copy each object into the destination, and then finally 
delete all objects at the original path. We must handle transient errors with 
retry strategies appropriate for the action at hand. We must handle serious or 
permanent errors where the RS doesn't need to be aborted with a rollback that 
cleans it all up. Finally, we must handle permanent errors where the RS must be 
aborted with a rollback during region open/recovery. Note that after all 
objects have been copied and we are deleting obsolete source objects we must 
roll forward, not back. To support recovery after an abort we must utilize the 
WAL to track transaction progress. Put markers in for StoreCommitTransaction 
start and completion state, with details of the store file(s) involved, so it 
can be rolled back during region recovery at open. This will be significant 
work in HFile, HStore, flusher, compactor, and HRegion. Wherever we use HDFS's 
rename now we would substitute the running of this new multi-step filesystem 
transaction.

We need to determine this for certain, but I believe the PUT or multipart 
upload of an object must complete before the object is visible, so we don't 
have to worry about the case where an object is visible before fully uploaded 
as part of normal operations. So an individual object copy will either happen 
entirely and the target will then become visible, or it won't and the target 
won't exist.

S3 has an optimization, PUT COPY 
(https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html), which 
the AmazonClient embedded in S3A utilizes for moves. When designing the 
StoreCommitTransaction be sure to allow for filesystem implementations that 
leverage a server side copy operation. Doing a get-then-put should be optional. 
(Not sure Hadoop has an interface that advertises this capability yet; we can 
add one if not.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to