Dandandan commented on issue #6782: URL: https://github.com/apache/arrow-datafusion/issues/6782#issuecomment-1832775559
I found one issue in the [benchmarks](https://github.com/JayjeetAtGithub/datafusion-duckdb-benchmark). We're using `fetchall` rather than something more optimized like `fetch_arrow_table`. This is not a problem when the output is small, but for large outputs it penalizes duckdb as it needs to convert each individual row to python objects (rather than e.g. doing it per batch or keeping most of it outside of Python). Running locally (query 10 of h2o benchmarks has a large output): ``` qnum: 10 SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count(*) AS count FROM h2o GROUP BY id1, id2, id3, id4, id5, id6; 8.523120959027437 9.530327208980452 9.380471624986967 9.175977834005607 9.123251708020689 ``` when using `fetch_arrow_table` it is quite a bit faster: ``` qnum: 10 SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count(*) AS count FROM h2o GROUP BY id1, id2, id3, id4, id5, id6; 3.766681333014276 3.5825100420042872 3.5103830420121085 3.5395747090224177 3.5533452079980634 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
