[ 
https://issues.apache.org/jira/browse/HBASE-27841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-27841:
----------------------------------------
    Description: 
Today’s HFile on S3 support lays out "files" in the S3 bucket exactly as it 
does in HDFS, and this is going to be a problem. S3 throttles IO to buckets 
based on prefix. See 
[https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/]
{quote}You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per 
second per prefix in an Amazon S3 bucket. There are no limits to the number of 
prefixes that you can have in your bucket. LIST and GET objects don’t share the 
same limit. The performance of LIST calls depend on the number of Deleted 
markers present at the top of an object version for a given prefix.
{quote}
Today this looks like:
{noformat}
/hbase/data/<namespace>/<table>/<region>/<store>/hfile1
/hbase/data/<namespace>/<table>/<region>/<store>/hfile2
...
{noformat}
Unfortunately by-prefix partitioning is performed by S3 in a black box manner 
with no API provided to hint it. Customary file separator characters like '/' 
are not specially considered. 

The situation we want to avoid is where the load accounted to one or more hot 
stores aggregates up to an inopportune choke point where S3 may have auto 
partitioned at the region, table, or namespace level. Paths to any given store 
should avoid sharing a common path prefix with those of another.

We can continue to represent metadata in a hierarchical manner. Metadata is 
infrequently accessed compared to data because it is cached, or can be made to 
be cached, because the size of metadata is a tiny fraction of the size of all 
data. So a resulting layout might look like:
{noformat}
/hbase/data/<namespace>/<table>/<region>/file.list
/<store>/hfile1
/<store>/hfile2
...
{noformat}
where {{file.list}} is our current manifest based HFile tracking, managed by 
FileBasedStoreFileTracker. This is simply a relocation of stores to a different 
path construction while maintaining all of the other housekeeping as-is. The 
file based store file tracker manifest allows us to make this change easily, 
and iteratively, supporting in-place migration. It seems straightforward to 
implement as new version of FileBasedStoreFileTracker with an automatic path to 
migration. Adapting the HBCK2 support for rebuilding the store file list should 
also be straightforward if we can version the FileBasedStoreFileTracker and 
teach it about the different versions.

Bucket layouts for the HFile archive should also take the same approach. 
Snapshots are based on archiving so tackling one takes care of the other.
{noformat}
/hbase/archive/data/<namespace>/<table>/<region>/file.list
/<store>/hfile1
/<store>/hfile2
...
{noformat}
e.g.
{noformat}
/f572d396fae9206628714fb2ce00f72e94f2258f/7f900f36ebc78d125c773ac0e3a000ad355b8ba1.hfile
/f572d396fae9206628714fb2ce00f72e94f2258f/988881adc9fc3655077dc2d4d757d480b5ea0e11.hfile
{noformat}
This is still not entirely ideal, but is the best we can do. Using 
cryptographic hashes of store metadata as prefixes distributes the placement of 
any given store into any potential S3 partition randomly. The probability of a 
client accessing any particular point in the HBase keyspace has a similar 
random distribution. These will not be the _same_ distribution, but should be a 
reasonable approximation. What should happen is hotspots will only impact 
clients accessing the specific store and region that is hotspotting, like 
today, and the blast radius is not significantly wider.

  was:
Today’s HFile on S3 support lays out "files" in the S3 bucket exactly as it 
does in HDFS, and this is going to be a problem.  S3 throttles IO to buckets 
based on prefix. See 
https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/
{quote}
You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second 
per prefix in an Amazon S3 bucket. There are no limits to the number of 
prefixes that you can have in your bucket. LIST and GET objects don’t share the 
same limit. The performance of LIST calls depend on the number of Deleted 
markers present at the top of an object version for a given prefix.
{quote}

Today this looks like:
{noformat}
/hbase/data/<namespace>/<table>/<region>/<store>/hfile1
/hbase/data/<namespace>/<table>/<region>/<store>/hfile2
...
{noformat}

Unfortunately by-prefix partitioning is performed by S3 in a black box manner 
with no API provided to hint it. Customary file separator characters like '/' 
are not specially considered. Depending on where partitions form one hot region 
could throttle all readers or writers on the whole cluster! Even if partitions 
form on path prefixes at the table or namespace level,although not as bad, one 
hot region could still block readers or writers for a whole table or namespace. 

The entities for which we desire parallel access should have their own path 
prefixes. As much as possible we want S3 to not get in the way of us accessing 
HFiles in stores in parallel. Therefore we must ensure that HFiles in each 
store can be accessed by different path-prefixes. Or, more exactly, we must 
avoid placing the HFiles for various stores into a bucket in a way where the 
paths to any given store’s HFiles share a common path prefix with those of 
another.

We can continue to represent metadata in a hierarchical manner. Metadata is 
infrequently accessed compared to data because it is cached, or can be made to 
be cached, because the size of metadata is a tiny fraction of the size of all 
data. So a resulting layout might look like:
{noformat}
/hbase/data/<namespace>/<table>/<region>/file.list
/<store>/hfile1
/<store>/hfile2
...
{noformat}

where {{file.list}} is our current manifest based HFile tracking, managed by 
FileBasedStoreFileTracker. This is simply a relocation of stores to a different 
path construction while maintaining all of the other housekeeping as is and the 
manifest allows us to make this change easily, and iteratively, supporting 
in-place migration. It seems straightforward to implement as new version of 
FileBasedStoreFileTracker with an automatic path to migration. Adapting the 
HBCK2 support for rebuilding the store file list should also be straightforward 
if we can version the FileBasedStoreFileTracker and teach it about the 
different versions. 

Bucket layouts for the HFile archive should also take the same approach. 
Snapshots are based on archiving so tackling one takes care of the other.

{noformat}
/hbase/archive/data/<namespace>/<table>/<region>/file.list
/<store>/hfile1
/<store>/hfile2
...
{noformat}

e.g.

{noformat}
/f572d396fae9206628714fb2ce00f72e94f2258f/7f900f36ebc78d125c773ac0e3a000ad355b8ba1.hfile
/f572d396fae9206628714fb2ce00f72e94f2258f/988881adc9fc3655077dc2d4d757d480b5ea0e11.hfile
{noformat}

This is still not entirely ideal, because who can say where in the path 
prefixes formed by store hashes S3 will decide to partition. It could be that 
all hashes beginning with {{/f572d396}} fall into one partition for throttling 
or {{/f572}} or even {{/f}} . However, given the expected gaussian distribution 
of load over a keyspace, the use of cryptographic hashes as prefixes provides 
the best possible mitigation of load based throttling on by-prefix partitions 
of the keyspace.


> Improving SFT store file layout for S3 and S3-alike object stores
> -----------------------------------------------------------------
>
>                 Key: HBASE-27841
>                 URL: https://issues.apache.org/jira/browse/HBASE-27841
>             Project: HBase
>          Issue Type: New Feature
>    Affects Versions: 2.5.4
>            Reporter: Andrew Kyle Purtell
>            Priority: Major
>
> Today’s HFile on S3 support lays out "files" in the S3 bucket exactly as it 
> does in HDFS, and this is going to be a problem. S3 throttles IO to buckets 
> based on prefix. See 
> [https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/]
> {quote}You can send 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per 
> second per prefix in an Amazon S3 bucket. There are no limits to the number 
> of prefixes that you can have in your bucket. LIST and GET objects don’t 
> share the same limit. The performance of LIST calls depend on the number of 
> Deleted markers present at the top of an object version for a given prefix.
> {quote}
> Today this looks like:
> {noformat}
> /hbase/data/<namespace>/<table>/<region>/<store>/hfile1
> /hbase/data/<namespace>/<table>/<region>/<store>/hfile2
> ...
> {noformat}
> Unfortunately by-prefix partitioning is performed by S3 in a black box manner 
> with no API provided to hint it. Customary file separator characters like '/' 
> are not specially considered. 
> The situation we want to avoid is where the load accounted to one or more hot 
> stores aggregates up to an inopportune choke point where S3 may have auto 
> partitioned at the region, table, or namespace level. Paths to any given 
> store should avoid sharing a common path prefix with those of another.
> We can continue to represent metadata in a hierarchical manner. Metadata is 
> infrequently accessed compared to data because it is cached, or can be made 
> to be cached, because the size of metadata is a tiny fraction of the size of 
> all data. So a resulting layout might look like:
> {noformat}
> /hbase/data/<namespace>/<table>/<region>/file.list
> /<store>/hfile1
> /<store>/hfile2
> ...
> {noformat}
> where {{file.list}} is our current manifest based HFile tracking, managed by 
> FileBasedStoreFileTracker. This is simply a relocation of stores to a 
> different path construction while maintaining all of the other housekeeping 
> as-is. The file based store file tracker manifest allows us to make this 
> change easily, and iteratively, supporting in-place migration. It seems 
> straightforward to implement as new version of FileBasedStoreFileTracker with 
> an automatic path to migration. Adapting the HBCK2 support for rebuilding the 
> store file list should also be straightforward if we can version the 
> FileBasedStoreFileTracker and teach it about the different versions.
> Bucket layouts for the HFile archive should also take the same approach. 
> Snapshots are based on archiving so tackling one takes care of the other.
> {noformat}
> /hbase/archive/data/<namespace>/<table>/<region>/file.list
> /<store>/hfile1
> /<store>/hfile2
> ...
> {noformat}
> e.g.
> {noformat}
> /f572d396fae9206628714fb2ce00f72e94f2258f/7f900f36ebc78d125c773ac0e3a000ad355b8ba1.hfile
> /f572d396fae9206628714fb2ce00f72e94f2258f/988881adc9fc3655077dc2d4d757d480b5ea0e11.hfile
> {noformat}
> This is still not entirely ideal, but is the best we can do. Using 
> cryptographic hashes of store metadata as prefixes distributes the placement 
> of any given store into any potential S3 partition randomly. The probability 
> of a client accessing any particular point in the HBase keyspace has a 
> similar random distribution. These will not be the _same_ distribution, but 
> should be a reasonable approximation. What should happen is hotspots will 
> only impact clients accessing the specific store and region that is 
> hotspotting, like today, and the blast radius is not significantly wider.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to