JunRuiLee opened a new issue, #8090:
URL: https://github.com/apache/paimon/issues/8090

   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/paimon/issues) 
and found nothing similar.
   
   
   ### Motivation
   
   A table with `file-index.bloom-filter.columns` lets readers skip data files 
that cannot contain a queried value, turning a point lookup from a full scan 
into a few file reads. Flink, Spark and Trino all do this; PyPaimon does not — 
it reads every data file even though the indexes already exist in the table.
   
   PyPaimon loads each file's index bytes from the manifest but never decodes 
them. This is the gap that hurts PyPaimon's main use case: upstream ETL writes 
indexed tables, and downstream Python (Daft / Ray / PyTorch / DuckDB) reads 
them for analytics and point lookups, paying a full scan for filters the index 
could have answered.
   
   Read-only and format-compatible: no write-path or public-API change.
   
   
   ### Solution
   
   
   ## Scope
   
   Add a skipping stage in `FileScanner._filter_manifest_entry` 
(`paimon-python/pypaimon/read/scanner/file_scanner.py`), right after the 
existing min/max simple-stats skip, gated on `entry.file.embedded_index is not 
None`: parse the index, map each leaf predicate to its column blob, and drop 
files that cannot match.
   
   ## Subtasks
   
   - [ ] **PR 1 — bloom-filter pushdown (file-level).** `FileIndexFormat` 
reader (parse magic / version / head table, slice column blobs), 
`file-index.read.enabled` option, `FastHash` + `BloomFilter64` port (`=` / 
`IN`, no RoaringBitmap), predicate→index evaluator mirroring Java 
`FileIndexPredicate`/`FileIndexResult`, wired into `_filter_manifest_entry`. 
E2E test: a table written with bloom columns, read by PyPaimon, identical 
results with/without pushdown.
   - [ ] **PR 2 — bitmap reader (row-level).** RoaringBitmap32 + row selection; 
adds `!=` / `isNull`.
   - [ ] **PR 3 — range-bitmap reader (row-level).** Range ops + TopN.
   - [ ] **PR 4 — external `.index` files.** For indexes over 
`file-index.in-manifest-threshold` (default 500B); earlier PRs cover only the 
embedded path.
   
   ## Non-goals
   
   - Writer-side index generation; this is read-only.
   - Global index (B-Tree / Lumina / Tantivy) — separate format, already 
partially supported.
   
   ## References
   
   `paimon-common/.../fileindex/`: `FileIndexFormat`, `FileIndexPredicate`, 
`bloomfilter/{BloomFilterFileIndex,FastHash}`; options in 
`paimon-api/.../CoreOptions.java`.
   
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to