mbutrovich commented on issue #3978: URL: https://github.com/apache/datafusion-comet/issues/3978#issuecomment-4635695243
Benchmarking the DataFusion 54 branch against main, q88 regresses roughly 12s per iteration (about 21s to 34s). The regression is entirely scan-side, not the join. On the `store_sales` `CometNativeScan` the metrics move like this: | metric (store_sales scan, summed) | main (DF 53.1) | DF 54 branch | | --- | --- | --- | | Time reading and parsing footer metadata | 2.0 m | 18.0 m | | Wall clock for file opening | 6.0 m | 18.0 m | | Number of bytes scanned | 586.6 MiB | 3.4 GiB | | Bloom filter / page index / row group stats eval | 0 | 0 | | Total time for joining | ~17.4 s | ~19.8 s | Shuffle output is identical (283025 bytes), so results are unchanged. The join moved less than 2s. The extra roughly 2.8 GB and the 9x footer parse are the regression. ### Mechanism The only predicates pushed into the `store_sales` scan are three `IsNotNull` filters on the join keys (`ss_hdemo_sk`, `ss_sold_time_sk`, `ss_store_sk`), which are never null. In DataFusion's `IsNotNull` pruning is rewritten to `null_count != row_count` (`datafusion/pruning/src/pruning_predicate.rs`), so a page is pruned only when it is entirely null. On non-null FK columns that never happens, so the Parquet page index (ColumnIndex plus OffsetIndex) is loaded and prunes nothing. The roughly 2.8 GB is consistent with that page index: about 280 KB per file across 10,237 files. Comet makes this worse than the SQL path because Comet's `CachingParquetReaderFactory` caches only the `Skip`-policy footer. The opener then re-runs `load_page_index` per open (its `missing_column_index` guard is always true), so the page index is re-read and never cached. The DataFusion `ListingTable` path does not show this because its `file_metadata_cache` loads the full metadata once with `PageIndexPolicy::Optional` and reuses it. ### What we ruled out - Not the join. Join time barely moved. - Not a `pushdown_filters` default change. Comet hard-codes `pushdown_filters = true` and `reorder_filters = true` in `get_options` on both branches. - Not bloom filters. Bloom, page index, and row group stats eval are all 0 ms in both runs. - Not an arrow or parquet version bump. Both branches pin arrow and parquet 58.3.0, so the new metadata parser (arrow 57) and the late materialization predicate cache (CachedArrayReader, arrow-rs PR 7850) are the same code in both builds. - Not a DataFusion 53.1 to 54 change in page-index logic. See the decisive test below. ### Decisive test: DF 53.1 and DF 54 behave identically I built a local repro that calls Comet's real scan path (`init_datasource_exec`) against a single wide Parquet file (128 Int32 columns, page index enabled, project one column, push `IsNotNull` on it) and measures `bytes_scanned`. I ran the identical test against DataFusion 53.1.0 and the 54 branch using local checkouts. | | bytes (no filter) | bytes (IsNotNull) | page index delta | | --- | --- | --- | --- | | DF 53.1.0 (Comet main) | 3,591,690 | 5,534,156 | 1,942,466 | | DF 54 branch | 3,591,690 | 5,534,156 | 1,942,466 | The page-index delta is byte-identical. The opener gate (`enable_page_index && page_pruning_predicate.is_some() && filter_number() > 0`), `build_page_pruning_predicate`, and `try_pushdown_filters` are also byte-identical between 53.1 and 54. So the page-index load for `IsNotNull` is pre-existing and version-invariant. The 54 branch did not start loading the page index. ### The open question Production q88 clearly differs between the two runs (586 MiB versus 3.4 GiB), but the isolated single-file test shows no difference between DF 53.1 and 54. So whatever flips it in production is not captured by varying the DataFusion version alone, which is the one variable I changed. Leading suspects, untested: 1. The two benchmark builds differ by more than the DataFusion version. They are different Comet commits and possibly different configs. Holding Comet code constant and varying only DataFusion showed no change, so a Comet-side change or config difference between the benchmarked builds is the most likely cause. 2. Scale and concurrency. 10,237 files and 1,824 partitions on S3. Comet's footer-only reader cache means the page index is re-read per partition, so the cost may only show at scale. ### Reproducer Pure-mechanism repro (single file, no cluster), added as a `#[tokio::test]` next to `init_datasource_exec`: generate a wide Parquet file with a page index, project one column, run the scan with and without an `IsNotNull` pushdown, and compare `bytes_scanned`. The filtered scan reads about 1.9 MB more for zero pruning. Disabling `enable_page_index` removes it. ### Possible directions - Skip loading the page index when the page-pruning predicate is null-presence only, or when row group stats already report `null_count == 0` for the referenced columns, since then `IsNotNull` provably cannot prune any page. - Have Comet's reader cache the full metadata once (load with `PageIndexPolicy::Optional` and store in the OnceCell) so the per-partition re-reads collapse to one, rather than re-reading the page index every open. - For q88 specifically neither helps the pruning (the predicate is non-selective), so the cheapest fix is to avoid loading the page index for null-only predicates. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
