Thanks for your reply, Stephen. I have rebased the feature branch with master and opened the PR below for merging it into master:
https://github.com/apache/hbase/pull/7192 Further reviews are welcome. Regards, Wellington On 2025/08/04 17:45:18 "Tak Lon (Stephen) Wu" wrote: > +1, the new feature looks good with table level setup especially when cache > is exhausted. > > Sent from Gmail Mobile > > > On Wed, Jul 30, 2025 at 1:41 PM Wellington Chevreuil < > wellington.chevre...@gmail.com> wrote: > > > Attaching the ycsb runs mentioned previously: > > > > > > https://docs.google.com/document/d/1juZoND9ju4daOUkHnlcYzU3Nr7wUQGKEORRHll3KPRQ/edit?tab=t.0 > > > > Em qua., 30 de jul. de 2025 às 17:38, Wellington Chevreuil < > > wellington.chevre...@gmail.com> escreveu: > > > > > Greetings everyone! As previously shared in this email > > > <https://lists.apache.org/thread/jr1cljrdct01xtqsrgp4fpb301j9h72k>, we > > > have been working on this functionality at Cloudera for some time, and as > > > we prepare to make it GA for our broader customer base, we thought it > > could > > > be a nice addition to the apache hbase distribution too. > > > > > > The most relevant use case for this functionality is when deploying hbase > > > root dir on an object store cloud storage, such as S3, relying on file > > > based bucket cache for optimal performance. For datasets where records > > have > > > a concept of date and access pattern based on such date values, i.e., > > most > > > accessed data are those with the most recent date value, time based > > > priority can be configured so that only these recent data need to be kept > > > in the cache. > > > > > > The current Time Based Priority for BucketCache implementation allows for > > > defining an "age" threshold for blocks to be kept in the BucketCache, > > where > > > blocks "older" than this threshold would bypass the BucketCache if read > > > (even when cacheOnRead is enabled), and in case of already cached blocks > > > ageing, those would be picked first by eviction runs. > > > > > > It has been developed in two stages: > > > 1) Time Based Priority for BucketCache: the initial framework for > > > extracting blocks age and the block priority logic in BucketCache. This > > > relies on the builtin cell timestamps for determining the block age, and > > > the existing DateTieredCompaction for grouping blocks of similar age > > within > > > the same file. The related design doc > > > < > > https://docs.google.com/document/d/1Qd3kvZodBDxHTFCIRtoePgMbvyuUSxeydi2SEWQFQro/edit?tab=t.0#heading=h.gjdgxs > > > > > > has been shared in the parent jira and in the discussion email mentioned > > > above. > > > 2) Custom Time Based BucketCache Priority: an enhancement over the > > initial > > > development, it extends DateTieredCompaction to allow for custom values > > to > > > be used for cell grouping into separate files. Custom implemented value > > > providers can be plugged into the framework, so that user schema specific > > > values can now be used for defining cache priority. The original cell > > > timestamp based priority has been wrapped into a builtin provider > > > implementation, as well as a qualifier based provider has also been > > > defined. This second phase design doc > > > < > > https://docs.google.com/document/d/1uBGIO9IQ-FbSrE5dnUMRtQS23NbCbAmRVDkAOADcU_E/edit?tab=t.0#heading=h.jxvnkznuj997 > > > > > > has also been shared in the related jira. > > > > > > The feature requires a global flag (disabled by default) to be turned on > > > in order to even perform age checks. It also requires extra configuration > > > on individual column families, as only blocks for the configured column > > > families would have the age checked. Blocks from column families not > > > defining any time based priority settings would simply be treated as high > > > priority ones and have preference to be cached. > > > > > > Our suggestion is to have this merged into master, branch-3 and branch-2 > > > branches. We had executed some ycsb runs to compare different setups for > > > the feature (all using S3 as the root dir storage), as well as a binary > > > version not containing this code as a baseline comparison on same > > hardware, > > > and while we see relevant impacts on the scenarios where the dataset > > > doesn't fit into the cache capacity, we see little deviation otherwise. > > > > > > Best Regards, > > > Wellington > > > > > >