ArnavBalyan opened a new pull request, #47:
URL: https://github.com/apache/paimon-mosaic/pull/47
- Add bloom filter support, today all scans/point lookups pushdown ont he
reader force a rowgroup scan.
- Add bloom filter to per column chunk, which can be used to check for
presence of data in the col chunk.
- There are 2 main additions:
1) Each column chunk gets it's own bloom filter
2) Each row group gets additional metadata on the location of the bloom
filter.
- Benchmark : Nyc Taxi dataset
(https://www.kaggle.com/datasets/elemento/nyc-yellow-taxi-trip-data)
- File size: No bloom: 51.10 MiB, With bloom: 56.85 MiB
- Small bench was created to test the bytes skipped on a sample query:
- Query IO without bloom: 51.10 MiB
- Query I/O with bloom: 3.19 KB (16000x reduction for absent value
lookup)
Bloom filter structure:
<img width="1404" height="1075" alt="image"
src="https://github.com/user-attachments/assets/ab5181d2-15c2-4ae1-8c34-d393eb95d78a"
/>
<br>
<br>
<br>
<br>
Changes in the file format:
<img width="2040" height="1777" alt="image"
src="https://github.com/user-attachments/assets/de557724-381b-483e-88ee-8f92b0a4df2b"
/>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]