Akanksha-kedia commented on PR #18898:
URL: https://github.com/apache/pinot/pull/18898#issuecomment-4861981518

   ## Review: Rebuild bloom filter on fpp change
   
   The overall approach is sound and consistent with how `RangeIndexHandler` 
detects version changes. **3 MAJOR issues** need to be addressed before 
merging, including one that causes an infinite rebuild loop.
   
   ---
   
   ### CRITICAL — Formula integer truncation causes infinite rebuild loop
   
   `BloomFilterHandler.java`, `computeExpectedNumHashFunctions()`
   
   The PR computes `k` via an intermediate `(long)` truncation of 
`optimalNumOfBits`, but Guava's `optimalNumOfHashFunctions` computes `k` 
directly from `fpp` only:
   ```java
   // Guava BloomFilter.java
   static int optimalNumOfHashFunctions(double p) {
       return max(1, (int) Math.round(-Math.log(p) / LOG_TWO));
   }
   ```
   The integer truncation causes mismatches at small cardinalities: 
`cardinality=1, fpp=0.01` → Guava writes `k=7`, PR computes expected `k=6` → 
every reload triggers rebuild → **infinite rebuild loop**.
   
   Fix — use Guava's direct formula:
   ```java
   return Math.max(1, (int) Math.round(-Math.log(fpp) / Math.log(2)));
   ```
   
   ---
   
   ### MAJOR — `updateIndices` duplicates ~20 lines from `isFppChanged` instead 
of calling the helper
   
   `BloomFilterHandler.java`, lines 158–183. `RangeIndexHandler` passes 
`segmentWriter` (a `Writer extends Reader`) directly to the helper that accepts 
a `Reader` — same pattern should be used here to avoid duplicated logic that 
can silently diverge.
   
   ---
   
   ### MAJOR — Version guard missing before reading at byte offset 9
   
   Without first verifying `VERSION == OnHeapGuavaBloomFilterCreator.VERSION`, 
a future format change silently reads garbage. Add:
   ```java
   int version = dataBuffer.getInt(4);
   if (version != OnHeapGuavaBloomFilterCreator.VERSION) {
       LOGGER.warn("Unexpected bloom filter version {} for segment/column 
{}/{}", version, segmentName, column);
       return false;
   }
   ```
   
   ---
   
   ### MAJOR — Test does not exercise the formula bug; needs a 
small-cardinality case
   
   `SegmentPreProcessorTest.java`, `testBloomFilterFppUpdate` uses `column3` 
which has high cardinality. The truncation mismatch only manifests for 
`cardinality ∈ {1, 3}` with `fpp=0.01`, so the test passes today despite the 
bug. Please add a test with a cardinality-1 column verifying: (1) unchanged fpp 
→ no rebuild, (2) changed fpp → rebuild triggered, (3) after rebuild → no 
rebuild (idempotency).
   
   ---
   
   ### MINOR — `maxSizeInBytes > 0` path not covered by the new test
   
   New test uses `BloomFilterConfig(0.1, 0, false)` (no size cap). A test that 
changes `maxSizeInBytes` such that the effective `fpp` and `k` change would 
cover the `GuavaBloomFilterReaderUtils.computeFPP` branch.
   
   ---
   
   ### Byte-offset correctness (verified — no issue)
   
   `NUM_HASH_FUNCTIONS_OFFSET = 9` is correct: 4 bytes `TYPE_VALUE` + 4 bytes 
`VERSION` + 1 byte Guava strategy ordinal + 1 byte `numHashFunctions`. 
Consistent with `BloomFilterReaderFactory.HEADER_SIZE=8` and 
`BaseGuavaBloomFilterReader.NUM_HASH_FUNCTIONS_OFFSET=1`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to