Attaching the ycsb runs mentioned previously:

https://docs.google.com/document/d/1juZoND9ju4daOUkHnlcYzU3Nr7wUQGKEORRHll3KPRQ/edit?tab=t.0

Em qua., 30 de jul. de 2025 às 17:38, Wellington Chevreuil <
wellington.chevre...@gmail.com> escreveu:

> Greetings everyone! As previously shared in this email
> <https://lists.apache.org/thread/jr1cljrdct01xtqsrgp4fpb301j9h72k>, we
> have been working on this functionality at Cloudera for some time, and as
> we prepare to make it GA for our broader customer base, we thought it could
> be a nice addition to the apache hbase distribution too.
>
> The most relevant use case for this functionality is when deploying hbase
> root dir on an object store cloud storage, such as S3, relying on file
> based bucket cache for optimal performance. For datasets where records have
> a concept of date and access pattern based on such date values, i.e., most
> accessed data are those with the most recent date value, time based
> priority can be configured so that only these recent data need to be kept
> in the cache.
>
> The current Time Based Priority for BucketCache implementation allows for
> defining an "age" threshold for blocks to be kept in the BucketCache, where
> blocks "older" than this threshold would bypass the BucketCache if read
> (even when cacheOnRead is enabled), and in case of already cached blocks
> ageing, those would be picked first by eviction runs.
>
> It has been developed in two stages:
> 1) Time Based Priority for BucketCache: the initial framework for
> extracting blocks age and the block priority logic in BucketCache. This
> relies on the builtin cell timestamps for determining the block age, and
> the existing DateTieredCompaction for grouping blocks of similar age within
> the same file. The related design doc
> <https://docs.google.com/document/d/1Qd3kvZodBDxHTFCIRtoePgMbvyuUSxeydi2SEWQFQro/edit?tab=t.0#heading=h.gjdgxs>
> has been shared in the parent jira and in the discussion email mentioned
> above.
> 2) Custom Time Based BucketCache Priority: an enhancement over the initial
> development, it extends DateTieredCompaction to allow for custom values to
> be used for cell grouping into separate files. Custom implemented value
> providers can be plugged into the framework, so that user schema specific
> values can now be used for defining cache priority. The original cell
> timestamp based priority has been wrapped into a builtin provider
> implementation, as well as a qualifier based provider has also been
> defined. This second phase design doc
> <https://docs.google.com/document/d/1uBGIO9IQ-FbSrE5dnUMRtQS23NbCbAmRVDkAOADcU_E/edit?tab=t.0#heading=h.jxvnkznuj997>
> has also been shared in the related jira.
>
> The feature requires a global flag (disabled by default) to be turned on
> in order to even perform age checks. It also requires extra configuration
> on individual column families, as only blocks for the configured column
> families would have the age checked. Blocks from column families not
> defining any time based priority settings would simply be treated as high
> priority ones and have preference to be cached.
>
> Our suggestion is to have this merged into master, branch-3 and branch-2
> branches. We had executed some ycsb runs to compare different setups for
> the feature (all using S3 as the root dir storage), as well as a binary
> version not containing this code as a baseline comparison on same hardware,
> and while we see relevant impacts on the scenarios where the dataset
> doesn't fit into the cache capacity, we see little deviation otherwise.
>
> Best Regards,
> Wellington
>

Reply via email to