Andrew Kyle Purtell created HBASE-30061:
-------------------------------------------
Summary: EWMA-based BlockCompressedSizePredicator
Key: HBASE-30061
URL: https://issues.apache.org/jira/browse/HBASE-30061
Project: HBase
Issue Type: Improvement
Components: HFile
Reporter: Andrew Kyle Purtell
Assignee: Andrew Kyle Purtell
{{PreviousBlockCompressionRatePredicator}} has three algorithmic deficiencies
that cause compressed blocks to systematically undershoot the configured block
size target: integer division truncation, single-sample estimation, and no
smoothing of the estimated compression ratio.
This proposed change adds a new {{BlockCompressedSizePredicator}}
implementation, {{EWMABlockSizePredicator}}, that implements the existing
{{BlockCompressedSizePredicator}} interface and addresses all three
deficiencies with double-precision arithmetic and weighted moving average
smoothed estimation of the compression ratio. This produces compressed HFile
blocks that are closer to the configured target block size than
{{PreviousBlockCompressionRatePredicator}}
The ratio is smoothed using a default alpha of 0.5. This adapts quickly to
changing data while dampening single-block variance. After 3 blocks, the EWMA
captures 87.5% of the true ratio. Alpha = 0.5 is chosen because HFile blocks
within a single file tend to have similar compression ratios (same column
family, similar data distribution), and fast adaptation matters more than
long-term smoothing since predicator state is per-file.
There is some prior art for this approach. Pebble (CockroachDB) uses an EWMA
estimator to predict compression ratios across blocks for adaptive compressor
selection. RocksDB implements a window-average cost predictor for CPU and IO
cost prediction of block-level compression.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)