JinHyuk Kim created HBASE-30174:
-----------------------------------
Summary: Add start offset option to ROWPREFIX_FIXED_LENGTH bloom
filter
Key: HBASE-30174
URL: https://issues.apache.org/jira/browse/HBASE-30174
Project: HBase
Issue Type: Task
Components: master
Reporter: JinHyuk Kim
Assignee: JinHyuk Kim
h2. Problem
The {{ROWPREFIX_FIXED_LENGTH}} bloom filter always hashes the prefix starting
from the beginning of the row key. This works well in many cases, but there are
also schemas where the leading bytes contain low-value or repetitive data such
as a fixed salt or bucket id.
For example, row keys like:
{code:java}
{salt}:{id1}:{id2}
{code}
may benefit more from building the bloom filter on {{id1}} rather than the
leading salt bytes.
In those cases, hashing from offset 0 reduces the effectiveness of the bloom
filter because part of the bloom key space is consumed by bytes that do not
meaningfully help distinguish HFiles.
h2. Suggestion
Introduce a new optional configuration:
{code:java}
RowPrefixBloomFilter.prefix_start_offset
{code}
This allows the bloom filter to skip a configurable number of leading bytes
before extracting the fixed-length prefix used for hashing. Defaults to 0.
The goal is to support rowkey layouts where the meaningful lookup prefix does
not start at byte {{{}0{}}}.
h2. Usage
{code:java}
create 'test', {
NAME => 'cf',
BLOOMFILTER => 'ROWPREFIX_FIXED_LENGTH',
CONFIGURATION => {
'RowPrefixBloomFilter.prefix_length' => '8',
'RowPrefixBloomFilter.prefix_start_offset' => '4'
}
}
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)