JunRuiLee opened a new issue, #8090: URL: https://github.com/apache/paimon/issues/8090
### Search before asking - [x] I searched in the [issues](https://github.com/apache/paimon/issues) and found nothing similar. ### Motivation A table with `file-index.bloom-filter.columns` lets readers skip data files that cannot contain a queried value, turning a point lookup from a full scan into a few file reads. Flink, Spark and Trino all do this; PyPaimon does not — it reads every data file even though the indexes already exist in the table. PyPaimon loads each file's index bytes from the manifest but never decodes them. This is the gap that hurts PyPaimon's main use case: upstream ETL writes indexed tables, and downstream Python (Daft / Ray / PyTorch / DuckDB) reads them for analytics and point lookups, paying a full scan for filters the index could have answered. Read-only and format-compatible: no write-path or public-API change. ### Solution ## Scope Add a skipping stage in `FileScanner._filter_manifest_entry` (`paimon-python/pypaimon/read/scanner/file_scanner.py`), right after the existing min/max simple-stats skip, gated on `entry.file.embedded_index is not None`: parse the index, map each leaf predicate to its column blob, and drop files that cannot match. ## Subtasks - [ ] **PR 1 — bloom-filter pushdown (file-level).** `FileIndexFormat` reader (parse magic / version / head table, slice column blobs), `file-index.read.enabled` option, `FastHash` + `BloomFilter64` port (`=` / `IN`, no RoaringBitmap), predicate→index evaluator mirroring Java `FileIndexPredicate`/`FileIndexResult`, wired into `_filter_manifest_entry`. E2E test: a table written with bloom columns, read by PyPaimon, identical results with/without pushdown. - [ ] **PR 2 — bitmap reader (row-level).** RoaringBitmap32 + row selection; adds `!=` / `isNull`. - [ ] **PR 3 — range-bitmap reader (row-level).** Range ops + TopN. - [ ] **PR 4 — external `.index` files.** For indexes over `file-index.in-manifest-threshold` (default 500B); earlier PRs cover only the embedded path. ## Non-goals - Writer-side index generation; this is read-only. - Global index (B-Tree / Lumina / Tantivy) — separate format, already partially supported. ## References `paimon-common/.../fileindex/`: `FileIndexFormat`, `FileIndexPredicate`, `bloomfilter/{BloomFilterFileIndex,FastHash}`; options in `paimon-api/.../CoreOptions.java`. ### Anything else? _No response_ ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
