[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

GitBox Mon, 21 Feb 2022 20:45:14 -0800


matthewmturner commented on issue #147:
URL: 
https://github.com/apache/arrow-datafusion/issues/147#issuecomment-1047420945



   Cross post from slack:
   
   I’m working on updating datafusions db-benchmark results based on datafusion 
v7.  i just got a first cut of the results compared to what i produced a couple 
months ago.  i was planning on finalizing the analysis before sharing but i 
wanted to provide a preview as i may not have time to finish for a day or two.  
this was produced using datafusion-python on an M1 Macbook.
   
   on December 27th we were at the below for group by:
   ```
   0.11225258399999993 # q1
   0.695109333 # q2
   2.932470125 # q3
   0.07341450000000016 # q4
   3.3075385419999996 # q5
   2.9051008750000005 # q7
   4.573697916 # q8
   68.875322208 # q10
   ```
   
   based on datafusion version 7:
   ```
   q1: 0.03743266599999995
   q2: 0.4997687500000001
   q3: 2.119365208
   q4: 0.034825500000000176
   q5: 2.144292417
   q7: 2.0165450419999997
   q8: 2.9783209999999993
   q10: 47.229685542
   ```
   
   We’ve seen pretty good performance increases across the board based on the 
latest release.  Compared to currently published db-benchmark that would put 
datafusion as the fastest / tied for faster on groupby queries Q1 and Q4.  In 
general, we had similar results to spark.
   
   For join in december we had:
   ```
   q1 took 261 ms
   q2 took 367 ms
   q3 took 334 ms
   q4 took 507 ms
   q5 took 1936 ms
   ```
   
   and now we are at:
   ```
   q1: 0.5796001249999999
   q2: 0.4178434580000001
   q3: 0.4701954159999999
   q4: 0.4357888750000001
   q5: 1.8161980410000003
   ```
   we have lost some performance on the join side, im not sure why, but 
compared to other engines we are still doing very well, with basically the best 
performance across the board.
   
   Please take these results as preliminary…im still working through things. 
   
   Im going to work on adding the missing group by queries now with the latest 
v7 functionality.  i also was thinking of contributing a script that would run 
the whole db-benchmark process so that anyone could use run db-benchmark as 
needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] matthewmturner commented on issue #147: Add DataFusion to h2oai/db-benchmark

Reply via email to