[
https://issues.apache.org/jira/browse/HBASE-29727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wellington Chevreuil updated HBASE-29727:
-----------------------------------------
Description:
For every block added to BucketCache, we create and keep a BlockCacheKey object
with a String attribute for the file name the blocks belong to, plus the Path
containing the entire path for the given file. HFiles will normally contain
many blocks, and for all blocks from a same file, these attributes will have
the very same value, yet, we create different instances for each of the blocks.
When using file based bucket cache, where the bucket cache size is in the TB
magnitude, the total block count in the cache can grow very large, and so is
the heap used by the BucketCache object, due to the high count of BlockCacheKey
instances it has to keep.
For a few years now, the reference architecture with my employer for hbase
clusters on the cloud has been to deploy hbase root dir on cloud storage, then
use ephemeral SSD disks shipped within the RSes node VMs to for a file based
BucketCache. At the moment, the standard VM profile used allows for as much as
1.6TB of BucketCache capacity. For a cache of such size, with the default block
size of 64KB, we see on average, 30M blocks, with a minimal heap usage around
12GB.
With cloud providers now offering different VM profiles with more ephemeral SSD
disks capacity, we are looking for alternatives to optimise the heap usage by
BucketCache. The approach proposed here, is to define a "string pool" for
mapping the String attributes in the BlockCacheKey class to integer ids, so
that we can save some bytes for blocks from same file.
For a 1.6TB fully used cache with blocks size averaging around 60KB, we get
about 30M blocks. The heap savings on this case was 2GB, falling from 10GB to
8GB usage.
With a same 1.6TB fully used cache, but with block size adjusted to produce
double of blocks (60M), the observed heap footprint without this change was
25GB, falling to 21GB with this change.
was:
For every block added to BucketCache, we create and keep a BlockCacheKey object
with a String attribute for the file name the blocks belong to, plus the Path
containing the entire path for the given file. HFiles will normally contain
many blocks, and for all blocks from a same file, these attributes will have
the very same value, yet, we create different instances for each of the blocks.
When using file based bucket cache, where the bucket cache size is in the TB
magnitude, the total block count in the cache can grow very large, and so is
the heap used by the BucketCache object, due to the high count of BlockCacheKey
instances it has to keep.
For a few years now, the reference architecture with my employer for hbase
clusters on the cloud has been to deploy hbase root dir on cloud storage, then
use ephemeral SSD disks shipped within the RSes node VMs to for a file based
BucketCache. At the moment, the standard VM profile used allows for as much as
1.6TB of BucketCache capacity. For a cache of such size, with the default block
size of 64KB, we see on average, 30M blocks, with a minimal heap usage around
12GB.
With cloud providers now offering different VM profiles with more ephemeral SSD
disks capacity, we are looking for alternatives to optimise the heap usage by
BucketCache. The approach proposed here, is to define a "string pool" for
mapping the String attributes in the BlockCacheKey class to integer ids, so
that we can save some bytes for blocks from same file.
> Introduce a String pool for repeating filename, region and cf string fields
> in BlockCacheKey
> --------------------------------------------------------------------------------------------
>
> Key: HBASE-29727
> URL: https://issues.apache.org/jira/browse/HBASE-29727
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 3.0.0-beta-1, 2.7.0, 2.6.4
> Reporter: Wellington Chevreuil
> Assignee: Wellington Chevreuil
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.6.5
>
>
> For every block added to BucketCache, we create and keep a BlockCacheKey
> object with a String attribute for the file name the blocks belong to, plus
> the Path containing the entire path for the given file. HFiles will normally
> contain many blocks, and for all blocks from a same file, these attributes
> will have the very same value, yet, we create different instances for each of
> the blocks. When using file based bucket cache, where the bucket cache size
> is in the TB magnitude, the total block count in the cache can grow very
> large, and so is the heap used by the BucketCache object, due to the high
> count of BlockCacheKey instances it has to keep.
> For a few years now, the reference architecture with my employer for hbase
> clusters on the cloud has been to deploy hbase root dir on cloud storage,
> then use ephemeral SSD disks shipped within the RSes node VMs to for a file
> based BucketCache. At the moment, the standard VM profile used allows for as
> much as 1.6TB of BucketCache capacity. For a cache of such size, with the
> default block size of 64KB, we see on average, 30M blocks, with a minimal
> heap usage around 12GB.
> With cloud providers now offering different VM profiles with more ephemeral
> SSD disks capacity, we are looking for alternatives to optimise the heap
> usage by BucketCache. The approach proposed here, is to define a "string
> pool" for mapping the String attributes in the BlockCacheKey class to integer
> ids, so that we can save some bytes for blocks from same file.
>
> For a 1.6TB fully used cache with blocks size averaging around 60KB, we get
> about 30M blocks. The heap savings on this case was 2GB, falling from 10GB to
> 8GB usage.
> With a same 1.6TB fully used cache, but with block size adjusted to produce
> double of blocks (60M), the observed heap footprint without this change was
> 25GB, falling to 21GB with this change.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)