[ 
https://issues.apache.org/jira/browse/HBASE-30061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-30061:
----------------------------------------
    Description: 
{{PreviousBlockCompressionRatePredicator}} has three issues that cause 
compressed blocks to undershoot the configured block size target: integer 
division truncation, single-sample estimation, and no smoothing of the 
estimated compression ratio.

This proposed change adds a new {{BlockCompressedSizePredicator}} 
implementation, {{{}EWMABlockSizePredicator{}}}, that implements the existing 
{{BlockCompressedSizePredicator}} interface and addresses all three 
deficiencies with double-precision arithmetic and weighted moving average 
smoothed estimation of the compression ratio. This produces compressed HFile 
blocks that are closer to the configured target block size than 
{{PreviousBlockCompressionRatePredicator}}

The ratio is smoothed using a default alpha of 0.5. This adapts quickly to 
changing data while dampening single-block variance. After 3 blocks, the EWMA 
captures 87.5% of the true ratio. Alpha = 0.5 is chosen because HFile blocks 
within a single file tend to have similar compression ratios (same column 
family, similar data distribution), and fast adaptation matters more than 
long-term smoothing since predicator state is per-file.

There is some prior art for this approach. Pebble (CockroachDB) uses an EWMA 
estimator to predict compression ratios across blocks for adaptive compressor 
selection. RocksDB implements a window-average cost predictor for CPU and IO 
cost prediction of block-level compression.

  was:
{{PreviousBlockCompressionRatePredicator}} has three algorithmic deficiencies 
that cause compressed blocks to systematically undershoot the configured block 
size target: integer division truncation, single-sample estimation, and no 
smoothing of the estimated compression ratio.

This proposed change adds a new {{BlockCompressedSizePredicator}} 
implementation, {{EWMABlockSizePredicator}}, that implements the existing 
{{BlockCompressedSizePredicator}} interface and addresses all three 
deficiencies with double-precision arithmetic and weighted moving average 
smoothed estimation of the compression ratio. This produces compressed HFile 
blocks that are closer to the configured target block size than 
{{PreviousBlockCompressionRatePredicator}}

The ratio is smoothed using a default alpha of 0.5. This adapts quickly to 
changing data while dampening single-block variance. After 3 blocks, the EWMA 
captures 87.5% of the true ratio. Alpha = 0.5 is chosen because HFile blocks 
within a single file tend to have similar compression ratios (same column 
family, similar data distribution), and fast adaptation matters more than 
long-term smoothing since predicator state is per-file.

There is some prior art for this approach. Pebble (CockroachDB) uses an EWMA 
estimator to predict compression ratios across blocks for adaptive compressor 
selection. RocksDB implements a window-average cost predictor for CPU and IO 
cost prediction of block-level compression.


> EWMA-based BlockCompressedSizePredicator
> ----------------------------------------
>
>                 Key: HBASE-30061
>                 URL: https://issues.apache.org/jira/browse/HBASE-30061
>             Project: HBase
>          Issue Type: Improvement
>          Components: HFile
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Minor
>
> {{PreviousBlockCompressionRatePredicator}} has three issues that cause 
> compressed blocks to undershoot the configured block size target: integer 
> division truncation, single-sample estimation, and no smoothing of the 
> estimated compression ratio.
> This proposed change adds a new {{BlockCompressedSizePredicator}} 
> implementation, {{{}EWMABlockSizePredicator{}}}, that implements the existing 
> {{BlockCompressedSizePredicator}} interface and addresses all three 
> deficiencies with double-precision arithmetic and weighted moving average 
> smoothed estimation of the compression ratio. This produces compressed HFile 
> blocks that are closer to the configured target block size than 
> {{PreviousBlockCompressionRatePredicator}}
> The ratio is smoothed using a default alpha of 0.5. This adapts quickly to 
> changing data while dampening single-block variance. After 3 blocks, the EWMA 
> captures 87.5% of the true ratio. Alpha = 0.5 is chosen because HFile blocks 
> within a single file tend to have similar compression ratios (same column 
> family, similar data distribution), and fast adaptation matters more than 
> long-term smoothing since predicator state is per-file.
> There is some prior art for this approach. Pebble (CockroachDB) uses an EWMA 
> estimator to predict compression ratios across blocks for adaptive compressor 
> selection. RocksDB implements a window-average cost predictor for CPU and IO 
> cost prediction of block-level compression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to