ICDE) [arrow-datafusion]

via GitHub Wed, 29 Nov 2023 14:05:33 -0800


Dandandan commented on issue #6782:
URL: 
https://github.com/apache/arrow-datafusion/issues/6782#issuecomment-1832775559


   I found one issue in the 
[benchmarks](https://github.com/JayjeetAtGithub/datafusion-duckdb-benchmark).
   We're using `fetchall` rather than something more optimized like 
`fetch_arrow_table`. This is not a problem when the output is small, but for 
large outputs it penalizes duckdb as it needs to convert each individual row to 
python objects (rather than e.g. doing it per batch or keeping most of it 
outside of Python).
   
   Running locally (query 10 of h2o benchmarks has a large output):
   ```
   qnum: 10
   SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count(*) AS count FROM 
h2o GROUP BY id1, id2, id3, id4, id5, id6;
   
   8.523120959027437
   9.530327208980452
   9.380471624986967
   9.175977834005607
   9.123251708020689
   ```
   
   when using `fetch_arrow_table` it is quite a bit faster:
   ```
   qnum: 10
   SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count(*) AS count FROM 
h2o GROUP BY id1, id2, id3, id4, id5, id6;
   
   3.766681333014276
   3.5825100420042872
   3.5103830420121085
   3.5395747090224177
   3.5533452079980634
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Write DataFusion paper for (SIGMOD / VLDB / ICDE) [arrow-datafusion]

Reply via email to