Re: [I] Easier Dataframe API for `map` [datafusion]

via GitHub Sat, 20 Jul 2024 06:54:59 -0700


goldmedal commented on issue #11546:
URL: https://github.com/apache/datafusion/issues/11546#issuecomment-2241158991


   Here is the benchmark result after removing `make_array` ( I also pushed a 
new commit to the draft PR):
   ```
   map_1000                time:   [10.105 ms 10.168 ms 10.233 ms]
                           change: [+0.0989% +1.5780% +2.8979%] (p = 0.02 < 
0.05)
                           Change within noise threshold.
   Found 3 outliers among 100 measurements (3.00%)
     3 (3.00%) high mild
   
   map_one_1000            time:   [44.081 ms 45.278 ms 46.808 ms]
                           change: [+1.8229% +4.9320% +8.3942%] (p = 0.00 < 
0.05)
                           Performance has regressed.
   Found 9 outliers among 100 measurements (9.00%)
     3 (3.00%) high mild
     6 (6.00%) high severe
   ```
   
   I think the result is really bad but I tried to understand why `make_array` 
is efficient. I noticed it uses `make_scalar_function` to handle the 
`ColumnarValue`. I guess it could be more efficient than 
`ScalarValue::into_array`.
   
https://github.com/apache/datafusion/blob/5da7ab300215c44ca5dc16771091890de22af99b/datafusion/functions-array/src/make_array.rs#L102-L104
   
   I will try to use this way to modify the two version and give another 
benchmark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Easier Dataframe API for `map` [datafusion]

Reply via email to