[ 
https://issues.apache.org/jira/browse/HBASE-27841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-27841:
----------------------------------------
    Summary: Improving SFT store file layout for S3 and S3-alike object stores  
(was: SFT support for alternative store file layouts for S3 buckets)

> Improving SFT store file layout for S3 and S3-alike object stores
> -----------------------------------------------------------------
>
>                 Key: HBASE-27841
>                 URL: https://issues.apache.org/jira/browse/HBASE-27841
>             Project: HBase
>          Issue Type: New Feature
>    Affects Versions: 2.5.4
>            Reporter: Andrew Kyle Purtell
>            Priority: Major
>
> Today’s HFile on S3 support lays out "files" in the S3 bucket exactly as it 
> does in HDFS, and this is going to be a problem.  S3 throttles IO to buckets 
> based on prefix. See 
> https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
> {quote}
> You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second 
> per prefix in an Amazon S3 bucket. There are no limits to the number of 
> prefixes that you can have in your bucket. LIST and GET objects don’t share 
> the same limit. The performance of LIST calls depend on the number of Deleted 
> markers present at the top of an object version for a given prefix.
> {quote}
> HBase lays out files in S3 just like it does in HDFS, which is today like:
> {noformat}
> /hbase/data/<namespace>/<table>/<region>/<store>/hfile1
> /hbase/data/<namespace>/<table>/<region>/<store>/hfile2
> ...
> {noformat}
> This is an anti-pattern for laying out data in a S3 bucket. Prefix 
> partitioning is performed by S3 in a black box manner. Customary file 
> separator characters like '/' are not specially considered. One hot region 
> could throttle all readers or writers on the whole cluster! Or S3 might 
> decide to throttle on a path prefix lower at the table or namespace level, 
> and although not as bad, one hot region could still block readers or writers 
> for a whole table or namespace. 
> The entities for which we desire parallel access should have their own path 
> prefixes. As much as possible we want S3 to not get in the way of us 
> accessing HFiles in stores in parallel. Therefore we must ensure that HFiles 
> in each store can be accessed by different path-prefixes. Or, more exactly, 
> we must avoid placing the HFiles for various stores into a bucket in a way 
> where the paths to any given store’s HFiles share a common path prefix with 
> those of another.
> We can continue to represent metadata in a hierarchical manner. Metadata is 
> infrequently accessed compared to data because it is cached, or can be made 
> to be cached, because the size of metadata is a tiny fraction of the size of 
> all data. So a resulting layout might look like:
> {noformat}
> /hbase/data/<namespace>/<table>/<region>/file.list
> /<store>/hfile1
> /<store>/hfile2
> ...
> {noformat}
> where {{file.list}} is our current manifest based HFile tracking, managed by 
> FileBasedStoreFileTracker. This is simply a relocation of stores to a 
> different path construction while maintaining all of the other housekeeping 
> as is and the manifest allows us to make this change easily, and iteratively, 
> supporting in-place migration. It seems straightforward to implement as new 
> version of FileBasedStoreFileTracker with an automatic path to migration. 
> Adapting the HBCK2 support for rebuilding the store file list should also be 
> straightforward if we can version the FileBasedStoreFileTracker and teach it 
> about the different versions. 
> Bucket layouts for the HFile archive should also take the same approach. 
> Snapshots are based on archiving so tackling one takes care of the other.
> {noformat}
> /hbase/archive/data/<namespace>/<table>/<region>/file.list
> /<store>/hfile1
> /<store>/hfile2
> ...
> {noformat}
> e.g.
> {noformat}
> /f572d396fae9206628714fb2ce00f72e94f2258f/7f900f36ebc78d125c773ac0e3a000ad355b8ba1.hfile
> /f572d396fae9206628714fb2ce00f72e94f2258f/988881adc9fc3655077dc2d4d757d480b5ea0e11.hfile
> {noformat}
> This is still not entirely ideal, because who can say where in those store 
> hashes S3 will decide to partition. It could be that all hashes beginning 
> with {{f572d396}} fall into one partition for throttling or {{f572}} or even 
> {{f}} . However, given the expected gaussian distribution of load over a 
> keyspace, the use of hashes as prefixes provides the best possible 
> distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to