[GitHub] [arrow-datafusion] korowa commented on pull request #5490: Memory limited hash join

via GitHub Wed, 08 Mar 2023 01:17:24 -0800


korowa commented on PR #5490:
URL: 
https://github.com/apache/arrow-datafusion/pull/5490#issuecomment-1459826565


   Thanks @Dandandan!
   
   > I think it would be great if we could run some benchmarks to show that 
we're not regressing too much (e.g. running tpch benchmark queries with joins).
   
   I've done a couple of iterations for
   ```
   cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path 
./parquet --format parquet
   ```
   With the result (a bit floating but I can't see any noticeable regressions 
-- ofc I may be wrong):
   
   ```
   Query               branch         main
   ----------------------------------------------
   Query 1 avg time:   1047.93 ms     1135.36 ms
   Query 2 avg time:   280.91 ms      286.69 ms
   Query 3 avg time:   323.87 ms      351.31 ms
   Query 4 avg time:   146.87 ms      146.58 ms
   Query 5 avg time:   482.85 ms      463.07 ms
   Query 6 avg time:   274.73 ms      342.29 ms
   Query 7 avg time:   750.73 ms      762.43 ms
   Query 8 avg time:   443.34 ms      426.89 ms
   Query 9 avg time:   821.48 ms      775.03 ms
   Query 10 avg time:  585.21 ms      584.16 ms
   Query 11 avg time:  247.56 ms      232.90 ms
   Query 12 avg time:  258.51 ms      231.19 ms
   Query 13 avg time:  899.16 ms      885.56 ms
   Query 14 avg time:  300.63 ms      282.56 ms
   Query 15 avg time:  346.36 ms      318.97 ms
   Query 16 avg time:  198.33 ms      184.26 ms
   Query 17 avg time:  4197.54 ms     4101.92 ms
   Query 18 avg time:  2726.41 ms     2548.96 ms
   Query 19 avg time:  566.67 ms      535.74 ms
   Query 20 avg time:  1193.82 ms     1319.49 ms
   Query 21 avg time:  1027.00 ms     1050.08 ms
   Query 22 avg time:  120.03 ms      111.32 ms
   ```
   
   Also
   ```
   cargo run --release --bin tpch -- benchmark datafusion --query 17 
--iterations 5 --path ./parquet --format parquet --debug | grep -e "build_time" 
-e "iteration"
   ```
   is not showing any difference in join build / probe times. Full log is in 
attachment.
   
[mem_mgmt_benches.txt](https://github.com/apache/arrow-datafusion/files/10918984/mem_mgmt_benches.txt)
   
   As for now, I tend to think that these changes do not affect performance 
significantly (due to the fact that reservations and memory estimations 
executed only once per partition/query), but I'll be happy to perform any 
additional checks.
   
   > Some reasons I defaulted to initializing the hashmap using the size of the 
left side is as following
   
   All of them look pretty reasonable and I agree with this approach, I was 
just trying to explain why I've decided to duplicate calculations from some 
functions subsequently called by `RawTable.with_capacity`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] korowa commented on pull request #5490: Memory limited hash join

Reply via email to