[ 
https://issues.apache.org/jira/browse/HBASE-20431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450384#comment-16450384
 ] 

Steve Loughran commented on HBASE-20431:
----------------------------------------

S3Guard is only needed when you want consistency on S3A'; amazon have their own 
(consistent emrfs). and other people (WDC) sell products which are consistent 
out the box.  if ceph is consistent, all is good and you don't need anything 
else. Trying to work with an inconsistent S3 is dangerous unless you explicitly 
put long delays in. For example, in a recovery, always wait a minute or more 
before listing.

bq.  in testing I noticed some times we'd get back (paraphrased) "200 Internal 
Error, please retry"

Not seen that; assume its handled in the AWS client. We do have retries on some 
throttles and transient errors, especially that final POST of an MPU, but 200 
isn't considered error code. 503 is throttle, I believe (see 
S3AUtils.translateException()) for our understandings there

bq.  We also have in our design scope running against Ceph's radosgw so I don't 
know if we can rely on it totally, but we can take advantage of it if we detect 
we are running against S3 proper.

Raw AWS S3 *absolutely* keeps the output of an MPU invisible until the final 
POST of the ordered list of checksums of the uploaded parts. You get billed for 
all that data, so its good to have code to list & purge it (the hadoop s3guard 
CLI does). Provided the other stores you work with do have the same MPU 
visibility semantics, all will be well.

Who to ask about Ceph? 

# Maybe [~stevewatt]  has a suggestion? It's good to ask the developers to see 
what they think their system should do...
# [~iyonger] has been testing S3A and ceph
# And I think maybe now we should make sure there is an explicit test for s3a 
which verifies that uncommitted MPUs aren't visible. I'm sure that's done 
implicitly, but having it drawn out into a single method is easier to look at 
when there are failures.

bq. I would not expect you to volunteer code, no worries! (That would be 
obnoxious... (smile))

Thanks. I'd volunteer Ewan and Thomas but (a) they don't listen to me and (b) 
they're going to do the API you need with a goal of having it work with other 
stores too.

FYI [~fabbri]

> Store commit transaction for filesystems that do not support an atomic rename
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-20431
>                 URL: https://issues.apache.org/jira/browse/HBASE-20431
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Andrew Purtell
>            Priority: Major
>
> HBase expects the Hadoop filesystem implementation to support an atomic 
> rename() operation. HDFS does. The S3 backed filesystems do not. The 
> fundamental issue is the non-atomic and eventually consistent nature of the 
> S3 service. A S3 bucket is not a filesystem. S3 is not always immediately 
> read-your-writes. Object metadata can be temporarily inconsistent just after 
> new objects are stored. There can be a settling period to ride over. 
> Renaming/moving objects from one path to another are copy operations with 
> O(file) complexity and O(data) time followed by a series of deletes with 
> O(file) complexity. Failures at any point prior to completion will leave the 
> operation in an inconsistent state. The missing atomic rename semantic opens 
> opportunities for corruption and data loss, which may or may not be 
> repairable with HBCK.
> Handling this at the HBase level could be done with a new multi-step 
> filesystem transaction framework. Call it StoreCommitTransaction. 
> SplitTransaction and MergeTransaction are well established cases where even 
> on HDFS we have non-atomic filesystem changes and are our implementation 
> template for the new work. In this new StoreCommitTransaction we'd be moving 
> flush and compaction temporaries out of the temporary directory into the 
> region store directory. On HDFS the implementation would be easy. We can rely 
> on the filesystem's atomic rename semantics. On S3 it would be work: First we 
> would build the list of objects to move, then copy each object into the 
> destination, and then finally delete all objects at the original path. We 
> must handle transient errors with retry strategies appropriate for the action 
> at hand. We must handle serious or permanent errors where the RS doesn't need 
> to be aborted with a rollback that cleans it all up. Finally, we must handle 
> permanent errors where the RS must be aborted with a rollback during region 
> open/recovery. Note that after all objects have been copied and we are 
> deleting obsolete source objects we must roll forward, not back. To support 
> recovery after an abort we must utilize the WAL to track transaction 
> progress. Put markers in for StoreCommitTransaction start and completion 
> state, with details of the store file(s) involved, so it can be rolled back 
> during region recovery at open. This will be significant work in HFile, 
> HStore, flusher, compactor, and HRegion. Wherever we use HDFS's rename now we 
> would substitute the running of this new multi-step filesystem transaction.
> We need to determine this for certain, but I believe on S3 the PUT or 
> multipart upload of an object must complete before the object is visible, so 
> we don't have to worry about the case where an object is visible before fully 
> uploaded as part of normal operations. So an individual object copy will 
> either happen entirely and the target will then become visible, or it won't 
> and the target won't exist.
> S3 has an optimization, PUT COPY 
> (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html), which 
> the AmazonClient embedded in S3A utilizes for moves. When designing the 
> StoreCommitTransaction be sure to allow for filesystem implementations that 
> leverage a server side copy operation. Doing a get-then-put should be 
> optional. (Not sure Hadoop has an interface that advertises this capability 
> yet; we can add one if not.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to