Attaching the ycsb runs mentioned previously: https://docs.google.com/document/d/1juZoND9ju4daOUkHnlcYzU3Nr7wUQGKEORRHll3KPRQ/edit?tab=t.0
Em qua., 30 de jul. de 2025 às 17:38, Wellington Chevreuil < wellington.chevre...@gmail.com> escreveu: > Greetings everyone! As previously shared in this email > <https://lists.apache.org/thread/jr1cljrdct01xtqsrgp4fpb301j9h72k>, we > have been working on this functionality at Cloudera for some time, and as > we prepare to make it GA for our broader customer base, we thought it could > be a nice addition to the apache hbase distribution too. > > The most relevant use case for this functionality is when deploying hbase > root dir on an object store cloud storage, such as S3, relying on file > based bucket cache for optimal performance. For datasets where records have > a concept of date and access pattern based on such date values, i.e., most > accessed data are those with the most recent date value, time based > priority can be configured so that only these recent data need to be kept > in the cache. > > The current Time Based Priority for BucketCache implementation allows for > defining an "age" threshold for blocks to be kept in the BucketCache, where > blocks "older" than this threshold would bypass the BucketCache if read > (even when cacheOnRead is enabled), and in case of already cached blocks > ageing, those would be picked first by eviction runs. > > It has been developed in two stages: > 1) Time Based Priority for BucketCache: the initial framework for > extracting blocks age and the block priority logic in BucketCache. This > relies on the builtin cell timestamps for determining the block age, and > the existing DateTieredCompaction for grouping blocks of similar age within > the same file. The related design doc > <https://docs.google.com/document/d/1Qd3kvZodBDxHTFCIRtoePgMbvyuUSxeydi2SEWQFQro/edit?tab=t.0#heading=h.gjdgxs> > has been shared in the parent jira and in the discussion email mentioned > above. > 2) Custom Time Based BucketCache Priority: an enhancement over the initial > development, it extends DateTieredCompaction to allow for custom values to > be used for cell grouping into separate files. Custom implemented value > providers can be plugged into the framework, so that user schema specific > values can now be used for defining cache priority. The original cell > timestamp based priority has been wrapped into a builtin provider > implementation, as well as a qualifier based provider has also been > defined. This second phase design doc > <https://docs.google.com/document/d/1uBGIO9IQ-FbSrE5dnUMRtQS23NbCbAmRVDkAOADcU_E/edit?tab=t.0#heading=h.jxvnkznuj997> > has also been shared in the related jira. > > The feature requires a global flag (disabled by default) to be turned on > in order to even perform age checks. It also requires extra configuration > on individual column families, as only blocks for the configured column > families would have the age checked. Blocks from column families not > defining any time based priority settings would simply be treated as high > priority ones and have preference to be cached. > > Our suggestion is to have this merged into master, branch-3 and branch-2 > branches. We had executed some ycsb runs to compare different setups for > the feature (all using S3 as the root dir storage), as well as a binary > version not containing this code as a baseline comparison on same hardware, > and while we see relevant impacts on the scenarios where the dataset > doesn't fit into the cache capacity, we see little deviation otherwise. > > Best Regards, > Wellington >