simonvandel commented on code in PR #9020:
URL: https://github.com/apache/arrow-datafusion/pull/9020#discussion_r1468801356
##########
datafusion/common/src/dfschema.rs:
##########
@@ -218,17 +218,28 @@ impl DFSchema {
if other_schema.fields.is_empty() {
return;
}
+
+ let self_fields: HashSet<&DFField> = self.fields.iter().collect();
Review Comment:
I tried this, but it didn't seem like it had any impact on performance, so
I'll keep the collect, which I find nicer.
<details>
<summary>Patch</summary>
```diff
diff --git a/datafusion/common/src/dfschema.rs
b/datafusion/common/src/dfschema.rs
index 2642032c9..6d9e50e09 100644
--- a/datafusion/common/src/dfschema.rs
+++ b/datafusion/common/src/dfschema.rs
@@ -219,9 +219,15 @@ impl DFSchema {
return;
}
- let self_fields: HashSet<&DFField> = self.fields.iter().collect();
- let self_unqualified_names: HashSet<&str> =
- self.fields.iter().map(|x| x.name().as_str()).collect();
+ let mut self_fields: HashSet<&DFField> =
+ HashSet::with_capacity(self.fields.len());
+ let mut self_unqualified_names: HashSet<&str> =
+ HashSet::with_capacity(self.fields.len());
+
+ for f in &self.fields {
+ self_fields.insert(f);
+ self_unqualified_names.insert(f.name().as_str());
+ }
let mut fields_to_add = vec![];
```
</details>
<details>
<summary>Single vs double iteration benchmark results</summary>
```
Benchmarking logical_select_one_from_700: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 6.7s, enable flat sampling, or reduce sample count to 60.
logical_select_one_from_700
time: [1.3132 ms 1.3188 ms 1.3251 ms]
change: [-0.4502% +0.0956% +0.6717%] (p = 0.73 >
0.05)
No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
1 (1.00%) low severe
2 (2.00%) low mild
5 (5.00%) high mild
2 (2.00%) high severe
physical_select_one_from_700
time: [4.5525 ms 4.5705 ms 4.5909 ms]
change: [-1.5314% -1.0539% -0.6109%] (p = 0.00 <
0.05)
Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
3 (3.00%) high mild
8 (8.00%) high severe
Benchmarking logical_trivial_join_low_numbered_columns: Warming up for
3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 6.8s, enable flat sampling, or reduce sample count to 60.
logical_trivial_join_low_numbered_columns
time: [1.3299 ms 1.3379 ms 1.3478 ms]
change: [-1.0025% +0.0082% +1.0720%] (p = 0.99 >
0.05)
No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
4 (4.00%) high mild
4 (4.00%) high severe
Benchmarking logical_trivial_join_high_numbered_columns: Warming up for
3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 7.0s, enable flat sampling, or reduce sample count to 50.
logical_trivial_join_high_numbered_columns
time: [1.3746 ms 1.3806 ms 1.3873 ms]
change: [-0.8147% -0.1908% +0.5022%] (p = 0.57 >
0.05)
No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
2 (2.00%) low mild
11 (11.00%) high mild
4 (4.00%) high severe
Benchmarking logical_aggregate_with_join: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 9.2s, enable flat sampling, or reduce sample count to 50.
logical_aggregate_with_join
time: [1.8058 ms 1.8132 ms 1.8212 ms]
change: [-0.1207% +0.7180% +1.5705%] (p = 0.11 >
0.05)
No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) high mild
6 (6.00%) high severe
physical_plan_tpch time: [5.5699 ms 5.5904 ms 5.6136 ms]
change: [-0.5328% +0.0473% +0.6442%] (p = 0.88 >
0.05)
No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
7 (7.00%) high severe
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]