[ 
https://issues.apache.org/jira/browse/HBASE-27232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570826#comment-17570826
 ] 

Wellington Chevreuil commented on HBASE-27232:
----------------------------------------------

{quote}how could this unified.encoded.blocksize actually reduce the number of 
blocks? Doesn't it result in smaller blocks and thus more blocks?
{quote}
Without it, we always consider only the raw size (before encoding) to decide a 
block boundary. For example, in one case we have seen a customer data set 
shrinking by a third when using FAST_DIFF encoding. So with the current logic, 
we were writing ~20KB size blocks, instead of 64KB, because we had read 64KB of 
raw data, decided to close the block, but the encoded size was 20KB.

> Fix checking for encoded block size when deciding if block should be closed
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-27232
>                 URL: https://issues.apache.org/jira/browse/HBASE-27232
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Major
>             Fix For: 3.0.0-alpha-4
>
>
> On HFileWriterImpl.checkBlockBoundary, we useed to consider the unencoded and 
> uncompressed data size when deciding to close a block and start a new one. 
> That could lead to varying "on-disk" block sizes, depending on the encoding 
> efficiency for the cells in each block.
> HBASE-17757 introduced the hbase.writer.unified.encoded.blocksize.ratio 
> property, as ration of the original configured block size, to be compared 
> against the encoded size. This was an attempt to ensure homogeneous block 
> sizes. However, the check introduced by HBASE-17757 also considers the 
> unencoded size, which in the cases where encoding efficiency is higher than 
> what's configured in hbase.writer.unified.encoded.blocksize.ratio, it would 
> still lead to varying block sizes.
> This patch changes that logic, to only consider encoded size if 
> hbase.writer.unified.encoded.blocksize.ratio property is set, otherwise, it 
> will consider the unencoded size. This gives a finer control over the on-disk 
> block sizes and the overall number of blocks when encoding is in use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to