[jira] [Commented] (HBASE-27841) Improving SFT store file layout for S3 and S3-alike object stores

Andrew Kyle Purtell (Jira) Thu, 04 May 2023 11:14:04 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-27841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719467#comment-17719467
 ]


Andrew Kyle Purtell commented on HBASE-27841:
---------------------------------------------

[~wchevreuil] in this design we also take care of HBASE-27825. The archiving of 
a store file would only require the coordinated update of two manifest files, 
which could be implemented as a procedure. The extremely inefficient (in 
comparison) server side file moves of the hfiles would no longer be necessary. 

> Improving SFT store file layout for S3 and S3-alike object stores
> -----------------------------------------------------------------
>
>                 Key: HBASE-27841
>                 URL: https://issues.apache.org/jira/browse/HBASE-27841
>             Project: HBase
>          Issue Type: New Feature
>    Affects Versions: 2.5.4
>            Reporter: Andrew Kyle Purtell
>            Priority: Major
>
> Today’s HFile on S3 support lays out "files" in the S3 bucket exactly as it 
> does in HDFS, and this is going to be a problem.  S3 throttles IO to buckets 
> based on prefix. See 
> https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
> {quote}
> You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second 
> per prefix in an Amazon S3 bucket. There are no limits to the number of 
> prefixes that you can have in your bucket. LIST and GET objects don’t share 
> the same limit. The performance of LIST calls depend on the number of Deleted 
> markers present at the top of an object version for a given prefix.
> {quote}
> Today this looks like:
> {noformat}
> /hbase/data/<namespace>/<table>/<region>/<store>/hfile1
> /hbase/data/<namespace>/<table>/<region>/<store>/hfile2
> ...
> {noformat}
> Unfortunately by-prefix partitioning is performed by S3 in a black box manner 
> with no API provided to hint it. Customary file separator characters like '/' 
> are not specially considered. Depending on where partitions form one hot 
> region could throttle all readers or writers on the whole cluster! Even if 
> partitions form on path prefixes at the table or namespace level,although not 
> as bad, one hot region could still block readers or writers for a whole table 
> or namespace. 
> The entities for which we desire parallel access should have their own path 
> prefixes. As much as possible we want S3 to not get in the way of us 
> accessing HFiles in stores in parallel. Therefore we must ensure that HFiles 
> in each store can be accessed by different path-prefixes. Or, more exactly, 
> we must avoid placing the HFiles for various stores into a bucket in a way 
> where the paths to any given store’s HFiles share a common path prefix with 
> those of another.
> We can continue to represent metadata in a hierarchical manner. Metadata is 
> infrequently accessed compared to data because it is cached, or can be made 
> to be cached, because the size of metadata is a tiny fraction of the size of 
> all data. So a resulting layout might look like:
> {noformat}
> /hbase/data/<namespace>/<table>/<region>/file.list
> /<store>/hfile1
> /<store>/hfile2
> ...
> {noformat}
> where {{file.list}} is our current manifest based HFile tracking, managed by 
> FileBasedStoreFileTracker. This is simply a relocation of stores to a 
> different path construction while maintaining all of the other housekeeping 
> as is and the manifest allows us to make this change easily, and iteratively, 
> supporting in-place migration. It seems straightforward to implement as new 
> version of FileBasedStoreFileTracker with an automatic path to migration. 
> Adapting the HBCK2 support for rebuilding the store file list should also be 
> straightforward if we can version the FileBasedStoreFileTracker and teach it 
> about the different versions. 
> Bucket layouts for the HFile archive should also take the same approach. 
> Snapshots are based on archiving so tackling one takes care of the other.
> {noformat}
> /hbase/archive/data/<namespace>/<table>/<region>/file.list
> /<store>/hfile1
> /<store>/hfile2
> ...
> {noformat}
> e.g.
> {noformat}
> /f572d396fae9206628714fb2ce00f72e94f2258f/7f900f36ebc78d125c773ac0e3a000ad355b8ba1.hfile
> /f572d396fae9206628714fb2ce00f72e94f2258f/988881adc9fc3655077dc2d4d757d480b5ea0e11.hfile
> {noformat}
> This is still not entirely ideal, because who can say where in the path 
> prefixes formed by store hashes S3 will decide to partition. It could be that 
> all hashes beginning with {{/f572d396}} fall into one partition for 
> throttling or {{/f572}} or even {{/f}} . However, given the expected gaussian 
> distribution of load over a keyspace, the use of cryptographic hashes as 
> prefixes provides the best possible mitigation of load based throttling on 
> by-prefix partitions of the keyspace.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-27841) Improving SFT store file layout for S3 and S3-alike object stores

Reply via email to