korowa commented on PR #5490: URL: https://github.com/apache/arrow-datafusion/pull/5490#issuecomment-1459826565
Thanks @Dandandan! > I think it would be great if we could run some benchmarks to show that we're not regressing too much (e.g. running tpch benchmark queries with joins). I've done a couple of iterations for ``` cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path ./parquet --format parquet ``` With the result (a bit floating but I can't see any noticeable regressions -- ofc I may be wrong): ``` Query branch main ---------------------------------------------- Query 1 avg time: 1047.93 ms 1135.36 ms Query 2 avg time: 280.91 ms 286.69 ms Query 3 avg time: 323.87 ms 351.31 ms Query 4 avg time: 146.87 ms 146.58 ms Query 5 avg time: 482.85 ms 463.07 ms Query 6 avg time: 274.73 ms 342.29 ms Query 7 avg time: 750.73 ms 762.43 ms Query 8 avg time: 443.34 ms 426.89 ms Query 9 avg time: 821.48 ms 775.03 ms Query 10 avg time: 585.21 ms 584.16 ms Query 11 avg time: 247.56 ms 232.90 ms Query 12 avg time: 258.51 ms 231.19 ms Query 13 avg time: 899.16 ms 885.56 ms Query 14 avg time: 300.63 ms 282.56 ms Query 15 avg time: 346.36 ms 318.97 ms Query 16 avg time: 198.33 ms 184.26 ms Query 17 avg time: 4197.54 ms 4101.92 ms Query 18 avg time: 2726.41 ms 2548.96 ms Query 19 avg time: 566.67 ms 535.74 ms Query 20 avg time: 1193.82 ms 1319.49 ms Query 21 avg time: 1027.00 ms 1050.08 ms Query 22 avg time: 120.03 ms 111.32 ms ``` Also ``` cargo run --release --bin tpch -- benchmark datafusion --query 17 --iterations 5 --path ./parquet --format parquet --debug | grep -e "build_time" -e "iteration" ``` is not showing any difference in join build / probe times. Full log is in attachment. [mem_mgmt_benches.txt](https://github.com/apache/arrow-datafusion/files/10918984/mem_mgmt_benches.txt) As for now, I tend to think that these changes do not affect performance significantly (due to the fact that reservations and memory estimations executed only once per partition/query), but I'll be happy to perform any additional checks. > Some reasons I defaulted to initializing the hashmap using the size of the left side is as following All of them look pretty reasonable and I agree with this approach, I was just trying to explain why I've decided to duplicate calculations from some functions subsequently called by `RawTable.with_capacity`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
