Andrew Kyle Purtell created HBASE-30061:
-------------------------------------------

             Summary: EWMA-based BlockCompressedSizePredicator
                 Key: HBASE-30061
                 URL: https://issues.apache.org/jira/browse/HBASE-30061
             Project: HBase
          Issue Type: Improvement
          Components: HFile
            Reporter: Andrew Kyle Purtell
            Assignee: Andrew Kyle Purtell


{{PreviousBlockCompressionRatePredicator}} has three algorithmic deficiencies 
that cause compressed blocks to systematically undershoot the configured block 
size target: integer division truncation, single-sample estimation, and no 
smoothing of the estimated compression ratio.

This proposed change adds a new {{BlockCompressedSizePredicator}} 
implementation, {{EWMABlockSizePredicator}}, that implements the existing 
{{BlockCompressedSizePredicator}} interface and addresses all three 
deficiencies with double-precision arithmetic and weighted moving average 
smoothed estimation of the compression ratio. This produces compressed HFile 
blocks that are closer to the configured target block size than 
{{PreviousBlockCompressionRatePredicator}}

The ratio is smoothed using a default alpha of 0.5. This adapts quickly to 
changing data while dampening single-block variance. After 3 blocks, the EWMA 
captures 87.5% of the true ratio. Alpha = 0.5 is chosen because HFile blocks 
within a single file tend to have similar compression ratios (same column 
family, similar data distribution), and fast adaptation matters more than 
long-term smoothing since predicator state is per-file.

There is some prior art for this approach. Pebble (CockroachDB) uses an EWMA 
estimator to predict compression ratios across blocks for adaptive compressor 
selection. RocksDB implements a window-average cost predictor for CPU and IO 
cost prediction of block-level compression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to