xudong963 commented on PR #21637:
URL: https://github.com/apache/datafusion/pull/21637#issuecomment-4448087201

   > 🤔 the benchmarks look slower -- maybe we can profile some of those queries 
and find space to get the performance back
   
   @alamb Good finding to avoid the PR introducing regression!
   
   I profiled the repeated ClickBench partitioned slow queries (`q6`, `q29`, 
and `q41`) on the PR build.
   
   `q29` was dominated by normal parquet decode / aggregation work 
(`snap::decompress`, RLE decoding, `SumAccumulator`), so I did not see a 
PR-specific hot spot there.
   
   `q6` was more useful: it was dominated by parquet 
open/planning/statistics/metrics setup rather than decode work. In particular, 
`ParquetFileMetrics::new`, `MetricBuilder::build`, and 
`LazyParquetSummaryCount` construction/destruction showed up in the sample 
profile. Since `q6` has no filters, this suggested the regression was from 
fixed per-file setup overhead rather than the fully-matched pruning path itself.
   
   The issue was that `ParquetFileMetrics::new` created a 
`LazyParquetSummaryCount` for `page_index_pages_skipped_by_fully_matched` for 
every opened file. Even though the counter was only registered on first use, 
constructing the lazy wrapper still cloned the filename, cloned the metrics 
set, and allocated an `Arc<OnceLock<_>>` for every file, including queries that 
never used this metric.
   
   I fixed this by removing the per-file `LazyParquetSummaryCount` field 
entirely. Page pruning now returns the `pages_skipped_by_fully_matched` count, 
and the opener registers `page_index_pages_skipped_by_fully_matched` only when 
that count is non-zero, using the already available `PreparedParquetOpen` 
filename / partition / metrics context. This keeps `ParquetFileMetrics::new` 
off the extra allocation/clone path for the common case.
   
   Now the benchmark is good: 
https://github.com/apache/datafusion/pull/21637#issuecomment-4447945184


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to