xudong963 commented on PR #21637: URL: https://github.com/apache/datafusion/pull/21637#issuecomment-4448087201
> 🤔 the benchmarks look slower -- maybe we can profile some of those queries and find space to get the performance back @alamb Good finding to avoid the PR introducing regression! I profiled the repeated ClickBench partitioned slow queries (`q6`, `q29`, and `q41`) on the PR build. `q29` was dominated by normal parquet decode / aggregation work (`snap::decompress`, RLE decoding, `SumAccumulator`), so I did not see a PR-specific hot spot there. `q6` was more useful: it was dominated by parquet open/planning/statistics/metrics setup rather than decode work. In particular, `ParquetFileMetrics::new`, `MetricBuilder::build`, and `LazyParquetSummaryCount` construction/destruction showed up in the sample profile. Since `q6` has no filters, this suggested the regression was from fixed per-file setup overhead rather than the fully-matched pruning path itself. The issue was that `ParquetFileMetrics::new` created a `LazyParquetSummaryCount` for `page_index_pages_skipped_by_fully_matched` for every opened file. Even though the counter was only registered on first use, constructing the lazy wrapper still cloned the filename, cloned the metrics set, and allocated an `Arc<OnceLock<_>>` for every file, including queries that never used this metric. I fixed this by removing the per-file `LazyParquetSummaryCount` field entirely. Page pruning now returns the `pages_skipped_by_fully_matched` count, and the opener registers `page_index_pages_skipped_by_fully_matched` only when that count is non-zero, using the already available `PreparedParquetOpen` filename / partition / metrics context. This keeps `ParquetFileMetrics::new` off the extra allocation/clone path for the common case. Now the benchmark is good: https://github.com/apache/datafusion/pull/21637#issuecomment-4447945184 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
