[
https://issues.apache.org/jira/browse/HBASE-27841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell updated HBASE-27841:
----------------------------------------
Description:
Today’s HFile on S3 support lays out "files" in the S3 bucket exactly as it
does in HDFS, and this is going to be a problem. S3 throttles IO to buckets
based on prefix. See
https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
{quote}
You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second
per prefix in an Amazon S3 bucket. There are no limits to the number of
prefixes that you can have in your bucket. LIST and GET objects don’t share the
same limit. The performance of LIST calls depend on the number of Deleted
markers present at the top of an object version for a given prefix.
{quote}
Today this looks like:
{noformat}
/hbase/data/<namespace>/<table>/<region>/<store>/hfile1
/hbase/data/<namespace>/<table>/<region>/<store>/hfile2
...
{noformat}
Unfortunately by-prefix partitioning is performed by S3 in a black box manner
with no API provided to hint it. Customary file separator characters like '/'
are not specially considered. Depending on where partitions form one hot region
could throttle all readers or writers on the whole cluster! Even if partitions
form on path prefixes at the table or namespace level,although not as bad, one
hot region could still block readers or writers for a whole table or namespace.
The entities for which we desire parallel access should have their own path
prefixes. As much as possible we want S3 to not get in the way of us accessing
HFiles in stores in parallel. Therefore we must ensure that HFiles in each
store can be accessed by different path-prefixes. Or, more exactly, we must
avoid placing the HFiles for various stores into a bucket in a way where the
paths to any given store’s HFiles share a common path prefix with those of
another.
We can continue to represent metadata in a hierarchical manner. Metadata is
infrequently accessed compared to data because it is cached, or can be made to
be cached, because the size of metadata is a tiny fraction of the size of all
data. So a resulting layout might look like:
{noformat}
/hbase/data/<namespace>/<table>/<region>/file.list
/<store>/hfile1
/<store>/hfile2
...
{noformat}
where {{file.list}} is our current manifest based HFile tracking, managed by
FileBasedStoreFileTracker. This is simply a relocation of stores to a different
path construction while maintaining all of the other housekeeping as is and the
manifest allows us to make this change easily, and iteratively, supporting
in-place migration. It seems straightforward to implement as new version of
FileBasedStoreFileTracker with an automatic path to migration. Adapting the
HBCK2 support for rebuilding the store file list should also be straightforward
if we can version the FileBasedStoreFileTracker and teach it about the
different versions.
Bucket layouts for the HFile archive should also take the same approach.
Snapshots are based on archiving so tackling one takes care of the other.
{noformat}
/hbase/archive/data/<namespace>/<table>/<region>/file.list
/<store>/hfile1
/<store>/hfile2
...
{noformat}
e.g.
{noformat}
/f572d396fae9206628714fb2ce00f72e94f2258f/7f900f36ebc78d125c773ac0e3a000ad355b8ba1.hfile
/f572d396fae9206628714fb2ce00f72e94f2258f/988881adc9fc3655077dc2d4d757d480b5ea0e11.hfile
{noformat}
This is still not entirely ideal, because who can say where in those store
hashes S3 will decide to partition. It could be that all hashes beginning with
{{f572d396}} fall into one partition for throttling or {{f572}} or even {{f}} .
However, given the expected gaussian distribution of load over a keyspace, the
use of hashes as prefixes provides the best possible distribution.
was:
Today’s HFile on S3 support lays out "files" in the S3 bucket exactly as it
does in HDFS, and this is going to be a problem. S3 throttles IO to buckets
based on prefix. See
https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
{quote}
You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second
per prefix in an Amazon S3 bucket. There are no limits to the number of
prefixes that you can have in your bucket. LIST and GET objects don’t share the
same limit. The performance of LIST calls depend on the number of Deleted
markers present at the top of an object version for a given prefix.
{quote}
HBase lays out files in S3 just like it does in HDFS, which is today like:
{noformat}
/hbase/data/<namespace>/<table>/<region>/<store>/hfile1
/hbase/data/<namespace>/<table>/<region>/<store>/hfile2
...
{noformat}
This is an anti-pattern for laying out data in a S3 bucket. Prefix partitioning
is performed by S3 in a black box manner. Customary file separator characters
like '/' are not specially considered. One hot region could throttle all
readers or writers on the whole cluster! Or S3 might decide to throttle on a
path prefix lower at the table or namespace level, and although not as bad, one
hot region could still block readers or writers for a whole table or namespace.
The entities for which we desire parallel access should have their own path
prefixes. As much as possible we want S3 to not get in the way of us accessing
HFiles in stores in parallel. Therefore we must ensure that HFiles in each
store can be accessed by different path-prefixes. Or, more exactly, we must
avoid placing the HFiles for various stores into a bucket in a way where the
paths to any given store’s HFiles share a common path prefix with those of
another.
We can continue to represent metadata in a hierarchical manner. Metadata is
infrequently accessed compared to data because it is cached, or can be made to
be cached, because the size of metadata is a tiny fraction of the size of all
data. So a resulting layout might look like:
{noformat}
/hbase/data/<namespace>/<table>/<region>/file.list
/<store>/hfile1
/<store>/hfile2
...
{noformat}
where {{file.list}} is our current manifest based HFile tracking, managed by
FileBasedStoreFileTracker. This is simply a relocation of stores to a different
path construction while maintaining all of the other housekeeping as is and the
manifest allows us to make this change easily, and iteratively, supporting
in-place migration. It seems straightforward to implement as new version of
FileBasedStoreFileTracker with an automatic path to migration. Adapting the
HBCK2 support for rebuilding the store file list should also be straightforward
if we can version the FileBasedStoreFileTracker and teach it about the
different versions.
Bucket layouts for the HFile archive should also take the same approach.
Snapshots are based on archiving so tackling one takes care of the other.
{noformat}
/hbase/archive/data/<namespace>/<table>/<region>/file.list
/<store>/hfile1
/<store>/hfile2
...
{noformat}
e.g.
{noformat}
/f572d396fae9206628714fb2ce00f72e94f2258f/7f900f36ebc78d125c773ac0e3a000ad355b8ba1.hfile
/f572d396fae9206628714fb2ce00f72e94f2258f/988881adc9fc3655077dc2d4d757d480b5ea0e11.hfile
{noformat}
This is still not entirely ideal, because who can say where in those store
hashes S3 will decide to partition. It could be that all hashes beginning with
{{f572d396}} fall into one partition for throttling or {{f572}} or even {{f}} .
However, given the expected gaussian distribution of load over a keyspace, the
use of hashes as prefixes provides the best possible distribution.
> Improving SFT store file layout for S3 and S3-alike object stores
> -----------------------------------------------------------------
>
> Key: HBASE-27841
> URL: https://issues.apache.org/jira/browse/HBASE-27841
> Project: HBase
> Issue Type: New Feature
> Affects Versions: 2.5.4
> Reporter: Andrew Kyle Purtell
> Priority: Major
>
> Today’s HFile on S3 support lays out "files" in the S3 bucket exactly as it
> does in HDFS, and this is going to be a problem. S3 throttles IO to buckets
> based on prefix. See
> https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
> {quote}
> You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second
> per prefix in an Amazon S3 bucket. There are no limits to the number of
> prefixes that you can have in your bucket. LIST and GET objects don’t share
> the same limit. The performance of LIST calls depend on the number of Deleted
> markers present at the top of an object version for a given prefix.
> {quote}
> Today this looks like:
> {noformat}
> /hbase/data/<namespace>/<table>/<region>/<store>/hfile1
> /hbase/data/<namespace>/<table>/<region>/<store>/hfile2
> ...
> {noformat}
> Unfortunately by-prefix partitioning is performed by S3 in a black box manner
> with no API provided to hint it. Customary file separator characters like '/'
> are not specially considered. Depending on where partitions form one hot
> region could throttle all readers or writers on the whole cluster! Even if
> partitions form on path prefixes at the table or namespace level,although not
> as bad, one hot region could still block readers or writers for a whole table
> or namespace.
> The entities for which we desire parallel access should have their own path
> prefixes. As much as possible we want S3 to not get in the way of us
> accessing HFiles in stores in parallel. Therefore we must ensure that HFiles
> in each store can be accessed by different path-prefixes. Or, more exactly,
> we must avoid placing the HFiles for various stores into a bucket in a way
> where the paths to any given store’s HFiles share a common path prefix with
> those of another.
> We can continue to represent metadata in a hierarchical manner. Metadata is
> infrequently accessed compared to data because it is cached, or can be made
> to be cached, because the size of metadata is a tiny fraction of the size of
> all data. So a resulting layout might look like:
> {noformat}
> /hbase/data/<namespace>/<table>/<region>/file.list
> /<store>/hfile1
> /<store>/hfile2
> ...
> {noformat}
> where {{file.list}} is our current manifest based HFile tracking, managed by
> FileBasedStoreFileTracker. This is simply a relocation of stores to a
> different path construction while maintaining all of the other housekeeping
> as is and the manifest allows us to make this change easily, and iteratively,
> supporting in-place migration. It seems straightforward to implement as new
> version of FileBasedStoreFileTracker with an automatic path to migration.
> Adapting the HBCK2 support for rebuilding the store file list should also be
> straightforward if we can version the FileBasedStoreFileTracker and teach it
> about the different versions.
> Bucket layouts for the HFile archive should also take the same approach.
> Snapshots are based on archiving so tackling one takes care of the other.
> {noformat}
> /hbase/archive/data/<namespace>/<table>/<region>/file.list
> /<store>/hfile1
> /<store>/hfile2
> ...
> {noformat}
> e.g.
> {noformat}
> /f572d396fae9206628714fb2ce00f72e94f2258f/7f900f36ebc78d125c773ac0e3a000ad355b8ba1.hfile
> /f572d396fae9206628714fb2ce00f72e94f2258f/988881adc9fc3655077dc2d4d757d480b5ea0e11.hfile
> {noformat}
> This is still not entirely ideal, because who can say where in those store
> hashes S3 will decide to partition. It could be that all hashes beginning
> with {{f572d396}} fall into one partition for throttling or {{f572}} or even
> {{f}} . However, given the expected gaussian distribution of load over a
> keyspace, the use of hashes as prefixes provides the best possible
> distribution.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)