Re: [DISCUSS] Merge HBASE-28463 (Time Based Priority for BucketCache) into release branches

Wellington Chevreuil Fri, 08 Aug 2025 05:14:52 -0700

Thanks for your reply, Stephen. I have rebased the feature branch with master 
and opened the PR below for merging it into master:


https://github.com/apache/hbase/pull/7192

Further reviews are welcome.

Regards,
Wellington

On 2025/08/04 17:45:18 "Tak Lon (Stephen) Wu" wrote:
> +1, the new feature looks good with table level setup especially when cache
> is exhausted.
> 
> Sent from Gmail Mobile
> 
> 
> On Wed, Jul 30, 2025 at 1:41 PM Wellington Chevreuil <
> [email protected]> wrote:
> 
> > Attaching the ycsb runs mentioned previously:
> >
> >
> > https://docs.google.com/document/d/1juZoND9ju4daOUkHnlcYzU3Nr7wUQGKEORRHll3KPRQ/edit?tab=t.0
> >
> > Em qua., 30 de jul. de 2025 às 17:38, Wellington Chevreuil <
> > [email protected]> escreveu:
> >
> > > Greetings everyone! As previously shared in this email
> > > <https://lists.apache.org/thread/jr1cljrdct01xtqsrgp4fpb301j9h72k>, we
> > > have been working on this functionality at Cloudera for some time, and as
> > > we prepare to make it GA for our broader customer base, we thought it
> > could
> > > be a nice addition to the apache hbase distribution too.
> > >
> > > The most relevant use case for this functionality is when deploying hbase
> > > root dir on an object store cloud storage, such as S3, relying on file
> > > based bucket cache for optimal performance. For datasets where records
> > have
> > > a concept of date and access pattern based on such date values, i.e.,
> > most
> > > accessed data are those with the most recent date value, time based
> > > priority can be configured so that only these recent data need to be kept
> > > in the cache.
> > >
> > > The current Time Based Priority for BucketCache implementation allows for
> > > defining an "age" threshold for blocks to be kept in the BucketCache,
> > where
> > > blocks "older" than this threshold would bypass the BucketCache if read
> > > (even when cacheOnRead is enabled), and in case of already cached blocks
> > > ageing, those would be picked first by eviction runs.
> > >
> > > It has been developed in two stages:
> > > 1) Time Based Priority for BucketCache: the initial framework for
> > > extracting blocks age and the block priority logic in BucketCache. This
> > > relies on the builtin cell timestamps for determining the block age, and
> > > the existing DateTieredCompaction for grouping blocks of similar age
> > within
> > > the same file. The related design doc
> > > <
> > https://docs.google.com/document/d/1Qd3kvZodBDxHTFCIRtoePgMbvyuUSxeydi2SEWQFQro/edit?tab=t.0#heading=h.gjdgxs
> > >
> > > has been shared in the parent jira and in the discussion email mentioned
> > > above.
> > > 2) Custom Time Based BucketCache Priority: an enhancement over the
> > initial
> > > development, it extends DateTieredCompaction to allow for custom values
> > to
> > > be used for cell grouping into separate files. Custom implemented value
> > > providers can be plugged into the framework, so that user schema specific
> > > values can now be used for defining cache priority. The original cell
> > > timestamp based priority has been wrapped into a builtin provider
> > > implementation, as well as a qualifier based provider has also been
> > > defined. This second phase design doc
> > > <
> > https://docs.google.com/document/d/1uBGIO9IQ-FbSrE5dnUMRtQS23NbCbAmRVDkAOADcU_E/edit?tab=t.0#heading=h.jxvnkznuj997
> > >
> > > has also been shared in the related jira.
> > >
> > > The feature requires a global flag (disabled by default) to be turned on
> > > in order to even perform age checks. It also requires extra configuration
> > > on individual column families, as only blocks for the configured column
> > > families would have the age checked. Blocks from column families not
> > > defining any time based priority settings would simply be treated as high
> > > priority ones and have preference to be cached.
> > >
> > > Our suggestion is to have this merged into master, branch-3 and branch-2
> > > branches. We had executed some ycsb runs to compare different setups for
> > > the feature (all using S3 as the root dir storage), as well as a binary
> > > version not containing this code as a baseline comparison on same
> > hardware,
> > > and while we see relevant impacts on the scenarios where the dataset
> > > doesn't fit into the cache capacity, we see little deviation otherwise.
> > >
> > > Best Regards,
> > > Wellington
> > >
> >
>

Re: [DISCUSS] Merge HBASE-28463 (Time Based Priority for BucketCache) into release branches

Reply via email to