mbutrovich commented on issue #3978:
URL: 
https://github.com/apache/datafusion-comet/issues/3978#issuecomment-4635695243

   Benchmarking the DataFusion 54 branch against main, q88 regresses roughly 
12s per iteration (about 21s to 34s). The regression is entirely scan-side, not 
the join. On the `store_sales` `CometNativeScan` the metrics move like this:
   
   | metric (store_sales scan, summed) | main (DF 53.1) | DF 54 branch |
   | --- | --- | --- |
   | Time reading and parsing footer metadata | 2.0 m | 18.0 m |
   | Wall clock for file opening | 6.0 m | 18.0 m |
   | Number of bytes scanned | 586.6 MiB | 3.4 GiB |
   | Bloom filter / page index / row group stats eval | 0 | 0 |
   | Total time for joining | ~17.4 s | ~19.8 s |
   
   Shuffle output is identical (283025 bytes), so results are unchanged. The 
join moved less than 2s. The extra roughly 2.8 GB and the 9x footer parse are 
the regression.
   
   ### Mechanism
   
   The only predicates pushed into the `store_sales` scan are three `IsNotNull` 
filters on the join keys (`ss_hdemo_sk`, `ss_sold_time_sk`, `ss_store_sk`), 
which are never null. In DataFusion's `IsNotNull` pruning is rewritten to 
`null_count != row_count` (`datafusion/pruning/src/pruning_predicate.rs`), so a 
page is pruned only when it is entirely null. On non-null FK columns that never 
happens, so the Parquet page index (ColumnIndex plus OffsetIndex) is loaded and 
prunes nothing. The roughly 2.8 GB is consistent with that page index: about 
280 KB per file across 10,237 files.
   
   Comet makes this worse than the SQL path because Comet's 
`CachingParquetReaderFactory` caches only the `Skip`-policy footer. The opener 
then re-runs `load_page_index` per open (its `missing_column_index` guard is 
always true), so the page index is re-read and never cached. The DataFusion 
`ListingTable` path does not show this because its `file_metadata_cache` loads 
the full metadata once with `PageIndexPolicy::Optional` and reuses it.
   
   ### What we ruled out
   
   - Not the join. Join time barely moved.
   - Not a `pushdown_filters` default change. Comet hard-codes 
`pushdown_filters = true` and `reorder_filters = true` in `get_options` on both 
branches.
   - Not bloom filters. Bloom, page index, and row group stats eval are all 0 
ms in both runs.
   - Not an arrow or parquet version bump. Both branches pin arrow and parquet 
58.3.0, so the new metadata parser (arrow 57) and the late materialization 
predicate cache (CachedArrayReader, arrow-rs PR 7850) are the same code in both 
builds.
   - Not a DataFusion 53.1 to 54 change in page-index logic. See the decisive 
test below.
   
   ### Decisive test: DF 53.1 and DF 54 behave identically
   
   I built a local repro that calls Comet's real scan path 
(`init_datasource_exec`) against a single wide Parquet file (128 Int32 columns, 
page index enabled, project one column, push `IsNotNull` on it) and measures 
`bytes_scanned`. I ran the identical test against DataFusion 53.1.0 and the 54 
branch using local checkouts.
   
   | | bytes (no filter) | bytes (IsNotNull) | page index delta |
   | --- | --- | --- | --- |
   | DF 53.1.0 (Comet main) | 3,591,690 | 5,534,156 | 1,942,466 |
   | DF 54 branch | 3,591,690 | 5,534,156 | 1,942,466 |
   
   The page-index delta is byte-identical. The opener gate (`enable_page_index 
&& page_pruning_predicate.is_some() && filter_number() > 0`), 
`build_page_pruning_predicate`, and `try_pushdown_filters` are also 
byte-identical between 53.1 and 54. So the page-index load for `IsNotNull` is 
pre-existing and version-invariant. The 54 branch did not start loading the 
page index.
   
   ### The open question
   
   Production q88 clearly differs between the two runs (586 MiB versus 3.4 
GiB), but the isolated single-file test shows no difference between DF 53.1 and 
54. So whatever flips it in production is not captured by varying the 
DataFusion version alone, which is the one variable I changed.
   
   Leading suspects, untested:
   
   1. The two benchmark builds differ by more than the DataFusion version. They 
are different Comet commits and possibly different configs. Holding Comet code 
constant and varying only DataFusion showed no change, so a Comet-side change 
or config difference between the benchmarked builds is the most likely cause.
   2. Scale and concurrency. 10,237 files and 1,824 partitions on S3. Comet's 
footer-only reader cache means the page index is re-read per partition, so the 
cost may only show at scale.
   
   ### Reproducer
   
   Pure-mechanism repro (single file, no cluster), added as a `#[tokio::test]` 
next to `init_datasource_exec`: generate a wide Parquet file with a page index, 
project one column, run the scan with and without an `IsNotNull` pushdown, and 
compare `bytes_scanned`. The filtered scan reads about 1.9 MB more for zero 
pruning. Disabling `enable_page_index` removes it.
   
   ### Possible directions
   
   - Skip loading the page index when the page-pruning predicate is 
null-presence only, or when row group stats already report `null_count == 0` 
for the referenced columns, since then `IsNotNull` provably cannot prune any 
page.
   - Have Comet's reader cache the full metadata once (load with 
`PageIndexPolicy::Optional` and store in the OnceCell) so the per-partition 
re-reads collapse to one, rather than re-reading the page index every open.
   - For q88 specifically neither helps the pruning (the predicate is 
non-selective), so the cheapest fix is to avoid loading the page index for 
null-only predicates.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to