Re: [PR] Speedup `DFSchema::merge` using HashSet indices [arrow-datafusion]

via GitHub Sun, 28 Jan 2024 01:28:18 -0800


simonvandel commented on code in PR #9020:
URL: https://github.com/apache/arrow-datafusion/pull/9020#discussion_r1468801356



##########
datafusion/common/src/dfschema.rs:
##########
@@ -218,17 +218,28 @@ impl DFSchema {
         if other_schema.fields.is_empty() {
             return;
         }
+
+        let self_fields: HashSet<&DFField> = self.fields.iter().collect();

Review Comment:
   I tried this, but it didn't seem like it had any impact on performance, so 
I'll keep the collect, which I find nicer.
   
   <details>
   
   <summary>Patch</summary>
   
   
   ```diff
   diff --git a/datafusion/common/src/dfschema.rs 
b/datafusion/common/src/dfschema.rs
   index 2642032c9..6d9e50e09 100644
   --- a/datafusion/common/src/dfschema.rs
   +++ b/datafusion/common/src/dfschema.rs
   @@ -219,9 +219,15 @@ impl DFSchema {
                return;
            }
    
   -        let self_fields: HashSet<&DFField> = self.fields.iter().collect();
   -        let self_unqualified_names: HashSet<&str> =
   -            self.fields.iter().map(|x| x.name().as_str()).collect();
   +        let mut self_fields: HashSet<&DFField> =
   +            HashSet::with_capacity(self.fields.len());
   +        let mut self_unqualified_names: HashSet<&str> =
   +            HashSet::with_capacity(self.fields.len());
   +
   +        for f in &self.fields {
   +            self_fields.insert(f);
   +            self_unqualified_names.insert(f.name().as_str());
   +        }
    
            let mut fields_to_add = vec![];
   ```
   
   </details>
   
   <details>
   
   <summary>Single vs double iteration benchmark results</summary>
   
   ```
   Benchmarking logical_select_one_from_700: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 6.7s, enable flat sampling, or reduce sample count to 60.
   logical_select_one_from_700
                           time:   [1.3132 ms 1.3188 ms 1.3251 ms]
                           change: [-0.4502% +0.0956% +0.6717%] (p = 0.73 > 
0.05)
                           No change in performance detected.
   Found 10 outliers among 100 measurements (10.00%)
     1 (1.00%) low severe
     2 (2.00%) low mild
     5 (5.00%) high mild
     2 (2.00%) high severe
   
   physical_select_one_from_700
                           time:   [4.5525 ms 4.5705 ms 4.5909 ms]
                           change: [-1.5314% -1.0539% -0.6109%] (p = 0.00 < 
0.05)
                           Change within noise threshold.
   Found 11 outliers among 100 measurements (11.00%)
     3 (3.00%) high mild
     8 (8.00%) high severe
   
   Benchmarking logical_trivial_join_low_numbered_columns: Warming up for 
3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 6.8s, enable flat sampling, or reduce sample count to 60.
   logical_trivial_join_low_numbered_columns
                           time:   [1.3299 ms 1.3379 ms 1.3478 ms]
                           change: [-1.0025% +0.0082% +1.0720%] (p = 0.99 > 
0.05)
                           No change in performance detected.
   Found 8 outliers among 100 measurements (8.00%)
     4 (4.00%) high mild
     4 (4.00%) high severe
   
   Benchmarking logical_trivial_join_high_numbered_columns: Warming up for 
3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 7.0s, enable flat sampling, or reduce sample count to 50.
   logical_trivial_join_high_numbered_columns
                           time:   [1.3746 ms 1.3806 ms 1.3873 ms]
                           change: [-0.8147% -0.1908% +0.5022%] (p = 0.57 > 
0.05)
                           No change in performance detected.
   Found 17 outliers among 100 measurements (17.00%)
     2 (2.00%) low mild
     11 (11.00%) high mild
     4 (4.00%) high severe
   
   Benchmarking logical_aggregate_with_join: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 9.2s, enable flat sampling, or reduce sample count to 50.
   logical_aggregate_with_join
                           time:   [1.8058 ms 1.8132 ms 1.8212 ms]
                           change: [-0.1207% +0.7180% +1.5705%] (p = 0.11 > 
0.05)
                           No change in performance detected.
   Found 10 outliers among 100 measurements (10.00%)
     4 (4.00%) high mild
     6 (6.00%) high severe
   
   physical_plan_tpch      time:   [5.5699 ms 5.5904 ms 5.6136 ms]
                           change: [-0.5328% +0.0473% +0.6442%] (p = 0.88 > 
0.05)
                           No change in performance detected.
   Found 7 outliers among 100 measurements (7.00%)
     7 (7.00%) high severe
   ```
   
   </details>
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Speedup `DFSchema::merge` using HashSet indices [arrow-datafusion]

Reply via email to