Akanksha-kedia commented on PR #18898:
URL: https://github.com/apache/pinot/pull/18898#issuecomment-4861981518
## Review: Rebuild bloom filter on fpp change
The overall approach is sound and consistent with how `RangeIndexHandler`
detects version changes. **3 MAJOR issues** need to be addressed before
merging, including one that causes an infinite rebuild loop.
---
### CRITICAL — Formula integer truncation causes infinite rebuild loop
`BloomFilterHandler.java`, `computeExpectedNumHashFunctions()`
The PR computes `k` via an intermediate `(long)` truncation of
`optimalNumOfBits`, but Guava's `optimalNumOfHashFunctions` computes `k`
directly from `fpp` only:
```java
// Guava BloomFilter.java
static int optimalNumOfHashFunctions(double p) {
return max(1, (int) Math.round(-Math.log(p) / LOG_TWO));
}
```
The integer truncation causes mismatches at small cardinalities:
`cardinality=1, fpp=0.01` → Guava writes `k=7`, PR computes expected `k=6` →
every reload triggers rebuild → **infinite rebuild loop**.
Fix — use Guava's direct formula:
```java
return Math.max(1, (int) Math.round(-Math.log(fpp) / Math.log(2)));
```
---
### MAJOR — `updateIndices` duplicates ~20 lines from `isFppChanged` instead
of calling the helper
`BloomFilterHandler.java`, lines 158–183. `RangeIndexHandler` passes
`segmentWriter` (a `Writer extends Reader`) directly to the helper that accepts
a `Reader` — same pattern should be used here to avoid duplicated logic that
can silently diverge.
---
### MAJOR — Version guard missing before reading at byte offset 9
Without first verifying `VERSION == OnHeapGuavaBloomFilterCreator.VERSION`,
a future format change silently reads garbage. Add:
```java
int version = dataBuffer.getInt(4);
if (version != OnHeapGuavaBloomFilterCreator.VERSION) {
LOGGER.warn("Unexpected bloom filter version {} for segment/column
{}/{}", version, segmentName, column);
return false;
}
```
---
### MAJOR — Test does not exercise the formula bug; needs a
small-cardinality case
`SegmentPreProcessorTest.java`, `testBloomFilterFppUpdate` uses `column3`
which has high cardinality. The truncation mismatch only manifests for
`cardinality ∈ {1, 3}` with `fpp=0.01`, so the test passes today despite the
bug. Please add a test with a cardinality-1 column verifying: (1) unchanged fpp
→ no rebuild, (2) changed fpp → rebuild triggered, (3) after rebuild → no
rebuild (idempotency).
---
### MINOR — `maxSizeInBytes > 0` path not covered by the new test
New test uses `BloomFilterConfig(0.1, 0, false)` (no size cap). A test that
changes `maxSizeInBytes` such that the effective `fpp` and `k` change would
cover the `GuavaBloomFilterReaderUtils.computeFPP` branch.
---
### Byte-offset correctness (verified — no issue)
`NUM_HASH_FUNCTIONS_OFFSET = 9` is correct: 4 bytes `TYPE_VALUE` + 4 bytes
`VERSION` + 1 byte Guava strategy ordinal + 1 byte `numHashFunctions`.
Consistent with `BloomFilterReaderFactory.HEADER_SIZE=8` and
`BaseGuavaBloomFilterReader.NUM_HASH_FUNCTIONS_OFFSET=1`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]