[
https://issues.apache.org/jira/browse/HBASE-27841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell updated HBASE-27841:
----------------------------------------
Summary: Improving SFT store file layout for S3 and S3-alike object stores
(was: SFT support for alternative store file layouts for S3 buckets)
> Improving SFT store file layout for S3 and S3-alike object stores
> -----------------------------------------------------------------
>
> Key: HBASE-27841
> URL: https://issues.apache.org/jira/browse/HBASE-27841
> Project: HBase
> Issue Type: New Feature
> Affects Versions: 2.5.4
> Reporter: Andrew Kyle Purtell
> Priority: Major
>
> Today’s HFile on S3 support lays out "files" in the S3 bucket exactly as it
> does in HDFS, and this is going to be a problem. S3 throttles IO to buckets
> based on prefix. See
> https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
> {quote}
> You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second
> per prefix in an Amazon S3 bucket. There are no limits to the number of
> prefixes that you can have in your bucket. LIST and GET objects don’t share
> the same limit. The performance of LIST calls depend on the number of Deleted
> markers present at the top of an object version for a given prefix.
> {quote}
> HBase lays out files in S3 just like it does in HDFS, which is today like:
> {noformat}
> /hbase/data/<namespace>/<table>/<region>/<store>/hfile1
> /hbase/data/<namespace>/<table>/<region>/<store>/hfile2
> ...
> {noformat}
> This is an anti-pattern for laying out data in a S3 bucket. Prefix
> partitioning is performed by S3 in a black box manner. Customary file
> separator characters like '/' are not specially considered. One hot region
> could throttle all readers or writers on the whole cluster! Or S3 might
> decide to throttle on a path prefix lower at the table or namespace level,
> and although not as bad, one hot region could still block readers or writers
> for a whole table or namespace.
> The entities for which we desire parallel access should have their own path
> prefixes. As much as possible we want S3 to not get in the way of us
> accessing HFiles in stores in parallel. Therefore we must ensure that HFiles
> in each store can be accessed by different path-prefixes. Or, more exactly,
> we must avoid placing the HFiles for various stores into a bucket in a way
> where the paths to any given store’s HFiles share a common path prefix with
> those of another.
> We can continue to represent metadata in a hierarchical manner. Metadata is
> infrequently accessed compared to data because it is cached, or can be made
> to be cached, because the size of metadata is a tiny fraction of the size of
> all data. So a resulting layout might look like:
> {noformat}
> /hbase/data/<namespace>/<table>/<region>/file.list
> /<store>/hfile1
> /<store>/hfile2
> ...
> {noformat}
> where {{file.list}} is our current manifest based HFile tracking, managed by
> FileBasedStoreFileTracker. This is simply a relocation of stores to a
> different path construction while maintaining all of the other housekeeping
> as is and the manifest allows us to make this change easily, and iteratively,
> supporting in-place migration. It seems straightforward to implement as new
> version of FileBasedStoreFileTracker with an automatic path to migration.
> Adapting the HBCK2 support for rebuilding the store file list should also be
> straightforward if we can version the FileBasedStoreFileTracker and teach it
> about the different versions.
> Bucket layouts for the HFile archive should also take the same approach.
> Snapshots are based on archiving so tackling one takes care of the other.
> {noformat}
> /hbase/archive/data/<namespace>/<table>/<region>/file.list
> /<store>/hfile1
> /<store>/hfile2
> ...
> {noformat}
> e.g.
> {noformat}
> /f572d396fae9206628714fb2ce00f72e94f2258f/7f900f36ebc78d125c773ac0e3a000ad355b8ba1.hfile
> /f572d396fae9206628714fb2ce00f72e94f2258f/988881adc9fc3655077dc2d4d757d480b5ea0e11.hfile
> {noformat}
> This is still not entirely ideal, because who can say where in those store
> hashes S3 will decide to partition. It could be that all hashes beginning
> with {{f572d396}} fall into one partition for throttling or {{f572}} or even
> {{f}} . However, given the expected gaussian distribution of load over a
> keyspace, the use of hashes as prefixes provides the best possible
> distribution.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)