[
https://issues.apache.org/jira/browse/HBASE-30061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell updated HBASE-30061:
----------------------------------------
Description:
{{PreviousBlockCompressionRatePredicator}} has three issues that cause
compressed blocks to undershoot the configured block size target: integer
division truncation, single-sample estimation, and no smoothing of the
estimated compression ratio.
This proposed change adds a new {{BlockCompressedSizePredicator}}
implementation, {{{}EWMABlockSizePredicator{}}}, that implements the existing
{{BlockCompressedSizePredicator}} interface and addresses all three
deficiencies with double-precision arithmetic and weighted moving average
smoothed estimation of the compression ratio. This produces compressed HFile
blocks that are closer to the configured target block size than
{{PreviousBlockCompressionRatePredicator}}
The ratio is smoothed using a default alpha of 0.5. This adapts quickly to
changing data while dampening single-block variance. After 3 blocks, the EWMA
captures 87.5% of the true ratio. Alpha = 0.5 is chosen because HFile blocks
within a single file tend to have similar compression ratios (same column
family, similar data distribution), and fast adaptation matters more than
long-term smoothing since predicator state is per-file.
There is some prior art for this approach. Pebble (CockroachDB) uses an EWMA
estimator to predict compression ratios across blocks for adaptive compressor
selection. RocksDB implements a window-average cost predictor for CPU and IO
cost prediction of block-level compression.
was:
{{PreviousBlockCompressionRatePredicator}} has three algorithmic deficiencies
that cause compressed blocks to systematically undershoot the configured block
size target: integer division truncation, single-sample estimation, and no
smoothing of the estimated compression ratio.
This proposed change adds a new {{BlockCompressedSizePredicator}}
implementation, {{EWMABlockSizePredicator}}, that implements the existing
{{BlockCompressedSizePredicator}} interface and addresses all three
deficiencies with double-precision arithmetic and weighted moving average
smoothed estimation of the compression ratio. This produces compressed HFile
blocks that are closer to the configured target block size than
{{PreviousBlockCompressionRatePredicator}}
The ratio is smoothed using a default alpha of 0.5. This adapts quickly to
changing data while dampening single-block variance. After 3 blocks, the EWMA
captures 87.5% of the true ratio. Alpha = 0.5 is chosen because HFile blocks
within a single file tend to have similar compression ratios (same column
family, similar data distribution), and fast adaptation matters more than
long-term smoothing since predicator state is per-file.
There is some prior art for this approach. Pebble (CockroachDB) uses an EWMA
estimator to predict compression ratios across blocks for adaptive compressor
selection. RocksDB implements a window-average cost predictor for CPU and IO
cost prediction of block-level compression.
> EWMA-based BlockCompressedSizePredicator
> ----------------------------------------
>
> Key: HBASE-30061
> URL: https://issues.apache.org/jira/browse/HBASE-30061
> Project: HBase
> Issue Type: Improvement
> Components: HFile
> Reporter: Andrew Kyle Purtell
> Assignee: Andrew Kyle Purtell
> Priority: Minor
>
> {{PreviousBlockCompressionRatePredicator}} has three issues that cause
> compressed blocks to undershoot the configured block size target: integer
> division truncation, single-sample estimation, and no smoothing of the
> estimated compression ratio.
> This proposed change adds a new {{BlockCompressedSizePredicator}}
> implementation, {{{}EWMABlockSizePredicator{}}}, that implements the existing
> {{BlockCompressedSizePredicator}} interface and addresses all three
> deficiencies with double-precision arithmetic and weighted moving average
> smoothed estimation of the compression ratio. This produces compressed HFile
> blocks that are closer to the configured target block size than
> {{PreviousBlockCompressionRatePredicator}}
> The ratio is smoothed using a default alpha of 0.5. This adapts quickly to
> changing data while dampening single-block variance. After 3 blocks, the EWMA
> captures 87.5% of the true ratio. Alpha = 0.5 is chosen because HFile blocks
> within a single file tend to have similar compression ratios (same column
> family, similar data distribution), and fast adaptation matters more than
> long-term smoothing since predicator state is per-file.
> There is some prior art for this approach. Pebble (CockroachDB) uses an EWMA
> estimator to predict compression ratios across blocks for adaptive compressor
> selection. RocksDB implements a window-average cost predictor for CPU and IO
> cost prediction of block-level compression.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)