Wellington Chevreuil created HBASE-27386:
--------------------------------------------
Summary: Use encoded size for calculating compression ratio in
block size predicator
Key: HBASE-27386
URL: https://issues.apache.org/jira/browse/HBASE-27386
Project: HBase
Issue Type: Bug
Reporter: Wellington Chevreuil
Assignee: Wellington Chevreuil
In HBASE-27264 we had introduced the notion of block size predicators to define
hfile block boundaries when writing a new hfile, and provided the
PreviousBlockCompressionRatePredicator implementation for calculating block
sizes based on a compression ratio. It was using the raw data size written to
the block so far to calculate the compression ratio, but in the case where
encoding is enabled, this could lead to a very high compression ratio and
therefore, larger block sizes. We should use the encoded size to calculate
compression ratio, instead.
Here's a example scenario:
1) Sample block size when not using the PreviousBlockCompressionRatePredicator
as implemented by HBASE-27264:
{noformat}
onDiskSizeWithoutHeader=6613, uncompressedSizeWithoutHeader=32928 {noformat}
2) Sample block size when using PreviousBlockCompressionRatePredicator as
implemented by HBASE-27264 (uses raw data size to calculate compression rate):
{noformat}
onDiskSizeWithoutHeader=126920, uncompressedSizeWithoutHeader=655393
{noformat}
3) Sample block size when using PreviousBlockCompressionRatePredicator with
encoded size for calculating compression rate:
{noformat}
onDiskSizeWithoutHeader=54299, uncompressedSizeWithoutHeader=328051
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)