Kontinuation commented on PR #14644: URL: https://github.com/apache/datafusion/pull/14644#issuecomment-2659161217
I had another interesting observation: spilling sort can be faster than memory unbounded sort in datafusion. I tried running sort-tpch Q3 using this PR with https://github.com/apache/datafusion/pull/14642 cherry-picked onto it, and configured `parquet.schema_force_view_types = false` to mitigate https://github.com/apache/datafusion/issues/12136#issuecomment-2656400964. Here are the test results obtained on a cloud instance with `Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz` CPU: ``` $./target/release/dfbench sort-tpch --iterations 1 --path benchmarks/data/tpch_sf10 --memory-limit 1000M -q 3 -n1 Q3 iteration 0 took 93339.0 ms and returned 59986052 rows Q3 avg time: 93339.00 ms $./target/release/dfbench sort-tpch --iterations 1 --path benchmarks/data/tpch_sf10 --memory-limit 500M -q 3 -n1 Q3 iteration 0 took 81831.2 ms and returned 59986052 rows Q3 avg time: 81831.18 ms $./target/release/dfbench sort-tpch --iterations 1 --path benchmarks/data/tpch_sf10 --memory-limit 200M -q 3 -n1 Q3 iteration 0 took 77046.4 ms and returned 59986052 rows Q3 avg time: 77046.36 ms $./target/release/dfbench sort-tpch --iterations 1 --path benchmarks/data/tpch_sf10 -q 3 -n1 Q3 iteration 0 took 170416.1 ms and returned 59986052 rows Q3 avg time: 170416.10 ms ``` When running without memory limit, we are merging tons of small sorted streams, this seems to be bad for performance. Memory limit enforces us to do merging before ingesting all the batches, so we are doing several smaller merges first and do a final merge at last to produce the result set. Coalescing batches into larger streams before merging seems to be a good idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org