[
https://issues.apache.org/jira/browse/HBASE-27841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719467#comment-17719467
]
Andrew Kyle Purtell commented on HBASE-27841:
---------------------------------------------
[~wchevreuil] in this design we also take care of HBASE-27825. The archiving of
a store file would only require the coordinated update of two manifest files,
which could be implemented as a procedure. The extremely inefficient (in
comparison) server side file moves of the hfiles would no longer be necessary.
> Improving SFT store file layout for S3 and S3-alike object stores
> -----------------------------------------------------------------
>
> Key: HBASE-27841
> URL: https://issues.apache.org/jira/browse/HBASE-27841
> Project: HBase
> Issue Type: New Feature
> Affects Versions: 2.5.4
> Reporter: Andrew Kyle Purtell
> Priority: Major
>
> Today’s HFile on S3 support lays out "files" in the S3 bucket exactly as it
> does in HDFS, and this is going to be a problem. S3 throttles IO to buckets
> based on prefix. See
> https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
> {quote}
> You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second
> per prefix in an Amazon S3 bucket. There are no limits to the number of
> prefixes that you can have in your bucket. LIST and GET objects don’t share
> the same limit. The performance of LIST calls depend on the number of Deleted
> markers present at the top of an object version for a given prefix.
> {quote}
> Today this looks like:
> {noformat}
> /hbase/data/<namespace>/<table>/<region>/<store>/hfile1
> /hbase/data/<namespace>/<table>/<region>/<store>/hfile2
> ...
> {noformat}
> Unfortunately by-prefix partitioning is performed by S3 in a black box manner
> with no API provided to hint it. Customary file separator characters like '/'
> are not specially considered. Depending on where partitions form one hot
> region could throttle all readers or writers on the whole cluster! Even if
> partitions form on path prefixes at the table or namespace level,although not
> as bad, one hot region could still block readers or writers for a whole table
> or namespace.
> The entities for which we desire parallel access should have their own path
> prefixes. As much as possible we want S3 to not get in the way of us
> accessing HFiles in stores in parallel. Therefore we must ensure that HFiles
> in each store can be accessed by different path-prefixes. Or, more exactly,
> we must avoid placing the HFiles for various stores into a bucket in a way
> where the paths to any given store’s HFiles share a common path prefix with
> those of another.
> We can continue to represent metadata in a hierarchical manner. Metadata is
> infrequently accessed compared to data because it is cached, or can be made
> to be cached, because the size of metadata is a tiny fraction of the size of
> all data. So a resulting layout might look like:
> {noformat}
> /hbase/data/<namespace>/<table>/<region>/file.list
> /<store>/hfile1
> /<store>/hfile2
> ...
> {noformat}
> where {{file.list}} is our current manifest based HFile tracking, managed by
> FileBasedStoreFileTracker. This is simply a relocation of stores to a
> different path construction while maintaining all of the other housekeeping
> as is and the manifest allows us to make this change easily, and iteratively,
> supporting in-place migration. It seems straightforward to implement as new
> version of FileBasedStoreFileTracker with an automatic path to migration.
> Adapting the HBCK2 support for rebuilding the store file list should also be
> straightforward if we can version the FileBasedStoreFileTracker and teach it
> about the different versions.
> Bucket layouts for the HFile archive should also take the same approach.
> Snapshots are based on archiving so tackling one takes care of the other.
> {noformat}
> /hbase/archive/data/<namespace>/<table>/<region>/file.list
> /<store>/hfile1
> /<store>/hfile2
> ...
> {noformat}
> e.g.
> {noformat}
> /f572d396fae9206628714fb2ce00f72e94f2258f/7f900f36ebc78d125c773ac0e3a000ad355b8ba1.hfile
> /f572d396fae9206628714fb2ce00f72e94f2258f/988881adc9fc3655077dc2d4d757d480b5ea0e11.hfile
> {noformat}
> This is still not entirely ideal, because who can say where in the path
> prefixes formed by store hashes S3 will decide to partition. It could be that
> all hashes beginning with {{/f572d396}} fall into one partition for
> throttling or {{/f572}} or even {{/f}} . However, given the expected gaussian
> distribution of load over a keyspace, the use of cryptographic hashes as
> prefixes provides the best possible mitigation of load based throttling on
> by-prefix partitions of the keyspace.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)