This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git
The following commit(s) were added to refs/heads/main by this push:
new b8a2c1ad9e [parquet] Avoid a clone while resolving the read strategy
(#9056)
b8a2c1ad9e is described below
commit b8a2c1ad9ea7a1b59350735ef3c52e6397406768
Author: Andrew Lamb <[email protected]>
AuthorDate: Mon Jan 5 13:35:31 2026 -0500
[parquet] Avoid a clone while resolving the read strategy (#9056)
# Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->
- related to https://github.com/apache/datafusion/pull/19477
# Rationale for this change
While working on https://github.com/apache/datafusion/pull/19477, and
profiling ClickBench q7, I noticed that the RowSelectors was being
cloned to resolve the strategy -- for a large number of selections this
is expensive and shows up in the traces
<img width="1724" height="1074" alt="Screenshot 2025-12-28 at 4 49
49 PM"
src="https://github.com/user-attachments/assets/72c6fd22-9377-48ef-ba80-6bc03b177cf7"
/>
```shell
samply record -- ./datafusion-cli-alamb_enable_pushdown -f q.sql >
/dev/null 2>&
```
We should change the code to avoid cloning the RowSelectors when
resolving the strategy.
# Changes
Don't clone / allocate while resolving the strategy.
I don't expect this to have a massive impact, but it did show up in the
profile
FYI @hhhizzz -- perhaps you could review this PR
# Are these changes tested?
Yes by CI
# Are there any user-facing changes?
small performance improvement
---
parquet/src/arrow/arrow_reader/read_plan.rs | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/parquet/src/arrow/arrow_reader/read_plan.rs
b/parquet/src/arrow/arrow_reader/read_plan.rs
index 3c17a358f0..7c9eb36bef 100644
--- a/parquet/src/arrow/arrow_reader/read_plan.rs
+++ b/parquet/src/arrow/arrow_reader/read_plan.rs
@@ -110,19 +110,22 @@ impl ReadPlanBuilder {
None => return RowSelectionStrategy::Selectors,
};
- let trimmed = selection.clone().trim();
- let selectors: Vec<RowSelector> = trimmed.into();
- if selectors.is_empty() {
- return RowSelectionStrategy::Mask;
- }
-
- let total_rows: usize = selectors.iter().map(|s|
s.row_count).sum();
- let selector_count = selectors.len();
- if selector_count == 0 {
+ // total_rows: total number of rows selected / skipped
+ // effective_count: number of non-empty selectors
+ let (total_rows, effective_count) =
+ selection.iter().fold((0usize, 0usize), |(rows, count), s|
{
+ if s.row_count > 0 {
+ (rows + s.row_count, count + 1)
+ } else {
+ (rows, count)
+ }
+ });
+
+ if effective_count == 0 {
return RowSelectionStrategy::Mask;
}
- if total_rows < selector_count.saturating_mul(threshold) {
+ if total_rows < effective_count.saturating_mul(threshold) {
RowSelectionStrategy::Mask
} else {
RowSelectionStrategy::Selectors