[jira] [Updated] (HBASE-29727) Introduce a String pool for repeating filename, region and cf string fields in BlockCacheKey

Wellington Chevreuil (Jira) Thu, 11 Jun 2026 06:27:13 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-29727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wellington Chevreuil updated HBASE-29727:
-----------------------------------------
    Description: 
For every block added to BucketCache, we create and keep a BlockCacheKey object 
with a String attribute for the file name the blocks belong to, plus the Path 
containing the entire path for the given file. HFiles will normally contain 
many blocks, and for all blocks from a same file, these attributes will have 
the very same value, yet, we create different instances for each of the blocks. 
When using file based bucket cache, where the bucket cache size is in the TB 
magnitude, the total block count in the cache can grow very large, and so is 
the heap used by the BucketCache object, due to the high count of BlockCacheKey 
instances it has to keep.

For a few years now, the reference architecture with my employer for hbase 
clusters on the cloud has been to deploy hbase root dir on cloud storage, then 
use ephemeral SSD disks shipped within the RSes node VMs to for a file based 
BucketCache. At the moment, the standard VM profile used allows for as much as 
1.6TB of BucketCache capacity. For a cache of such size, with the default block 
size of 64KB, we see on average, 30M blocks, with a minimal heap usage around 
12GB.

With cloud providers now offering different VM profiles with more ephemeral SSD 
disks capacity, we are looking for alternatives to optimise the heap usage by 
BucketCache. The approach proposed here, is to define a "string pool" for 
mapping the String attributes in the BlockCacheKey class to integer ids, so 
that we can save some bytes for blocks from same file.

 

For a 1.6TB fully used cache with blocks size averaging around 60KB, we get 
about 30M blocks. The heap savings on this case was 2GB, falling from 10GB to 
8GB usage.

With a same 1.6TB fully used cache, but with block size adjusted to produce 
double of blocks (60M), the observed heap footprint without this change was 
25GB, falling to 21GB with this change.

  was:
For every block added to BucketCache, we create and keep a BlockCacheKey object 
with a String attribute for the file name the blocks belong to, plus the Path 
containing the entire path for the given file. HFiles will normally contain 
many blocks, and for all blocks from a same file, these attributes will have 
the very same value, yet, we create different instances for each of the blocks. 
When using file based bucket cache, where the bucket cache size is in the TB 
magnitude, the total block count in the cache can grow very large, and so is 
the heap used by the BucketCache object, due to the high count of BlockCacheKey 
instances it has to keep.

For a few years now, the reference architecture with my employer for hbase 
clusters on the cloud  has been to deploy hbase root dir on cloud storage, then 
use ephemeral SSD disks shipped within the RSes node VMs to for a file based 
BucketCache. At the moment, the standard VM profile used allows for as much as 
1.6TB of BucketCache capacity. For a cache of such size, with the default block 
size of 64KB, we see on average, 30M blocks, with a minimal heap usage around 
12GB.

With cloud providers now offering different VM profiles with more ephemeral SSD 
disks capacity, we are looking for alternatives to optimise the heap usage by 
BucketCache. The approach proposed here, is to define a "string pool" for 
mapping the String attributes in the BlockCacheKey class to integer ids, so 
that we can save some bytes for blocks from same file. 




> Introduce a String pool for repeating filename, region and cf string fields 
> in BlockCacheKey
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29727
>                 URL: https://issues.apache.org/jira/browse/HBASE-29727
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 3.0.0-beta-1, 2.7.0, 2.6.4
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.7.0, 3.0.0-beta-2, 2.6.5
>
>
> For every block added to BucketCache, we create and keep a BlockCacheKey 
> object with a String attribute for the file name the blocks belong to, plus 
> the Path containing the entire path for the given file. HFiles will normally 
> contain many blocks, and for all blocks from a same file, these attributes 
> will have the very same value, yet, we create different instances for each of 
> the blocks. When using file based bucket cache, where the bucket cache size 
> is in the TB magnitude, the total block count in the cache can grow very 
> large, and so is the heap used by the BucketCache object, due to the high 
> count of BlockCacheKey instances it has to keep.
> For a few years now, the reference architecture with my employer for hbase 
> clusters on the cloud has been to deploy hbase root dir on cloud storage, 
> then use ephemeral SSD disks shipped within the RSes node VMs to for a file 
> based BucketCache. At the moment, the standard VM profile used allows for as 
> much as 1.6TB of BucketCache capacity. For a cache of such size, with the 
> default block size of 64KB, we see on average, 30M blocks, with a minimal 
> heap usage around 12GB.
> With cloud providers now offering different VM profiles with more ephemeral 
> SSD disks capacity, we are looking for alternatives to optimise the heap 
> usage by BucketCache. The approach proposed here, is to define a "string 
> pool" for mapping the String attributes in the BlockCacheKey class to integer 
> ids, so that we can save some bytes for blocks from same file.
>  
> For a 1.6TB fully used cache with blocks size averaging around 60KB, we get 
> about 30M blocks. The heap savings on this case was 2GB, falling from 10GB to 
> 8GB usage.
> With a same 1.6TB fully used cache, but with block size adjusted to produce 
> double of blocks (60M), the observed heap footprint without this change was 
> 25GB, falling to 21GB with this change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-29727) Introduce a String pool for repeating filename, region and cf string fields in BlockCacheKey

Reply via email to