Re: [I] Optimize the join operators [datafusion]

via GitHub Wed, 16 Jul 2025 05:07:06 -0700


zhuqi-lucas commented on issue #16710:
URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3078275180


   > [@zhuqi-lucas](https://github.com/zhuqi-lucas) - these benchmarks use 
Parquet files, see the querybench repo for the code: 
https://github.com/MrPowers/querybench. I think Parquet is a lot better for 
these benchmarks.
   > 
   > The data generation scripts are in falsa if you'd like to generate the 
files locally: https://github.com/mrpowers-io/falsa/ (thanks 
[@SemyonSinchenko](https://github.com/SemyonSinchenko)!)
   > 
   > DuckDB isn't included because it can't handle the joins on my machine with 
the 1e8 datasets. I guess it runs out of memory. It can handle the 1e7 datasets 
fine.
   > 
   > There are 5 h2o join queries and q5 is omitted (the join between two large 
tables) because no engine can handle joining the 1e8 table with another 1e8 
table on my machine with 16GB of RAM.
   
   Thank you @mrpowers-wb for good explain, i will submit a PR for datafusion 
h2o benchmark to support parquet format first, so we can optimize based the 
tool for this compare.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Optimize the join operators [datafusion]

Reply via email to