UBarney commented on PR #16889: URL: https://github.com/apache/datafusion/pull/16889#issuecomment-3124214924
> > I didn't expect this PR to have such a big performance improvement. Like #16443, I still don't understand why there is a performance improvement. > > I’m aware of two key differences: > > * Fewer redundant steps for `indices <--> batches` > * Always keeping the right batch in cache (the original implementation performs a left-chunk × right-row iteration) > > However I was just hoping to cleanup the codebase a bit, I also didn’t expect it to be an easy 2X. I can't use perf to analyze cache misses in my Hyper-V VM. <details> ``` sudo perf stat -e cycles,instructions,cache-references,cache-misses ./target/release/1_left_row_join_right_batch -c ' SELECT * FROM range(10000) AS t1 JOIN range(200000) AS t2 ON (t1.value + t2.value) % 1000 = 0;'erf list cache -M Performance counter stats for './target/release/1_left_row_join_right_batch -c SELECT * FROM range(10000) AS t1 JOIN range(200000) AS t2 ON (t1.value + t2.value) % 1000 = 0;': <not supported> cycles <not supported> instructions <not supported> cache-references <not supported> cache-misses 0.662652693 seconds time elapsed 11.216489000 seconds user 0.082434000 seconds sys ``` </details> However, using the `time` command, I discovered that the previous version had a significantly higher number of `Minor (reclaiming a frame) page faults` (3,207,160 vs 5,133) and much greater system time (19.36s vs 0.06s). <details> ``` /usr/bin/time -v ./target/release/join_limit_join_batch_size -c ' SELECT * FROM range(10000) AS t1 JOIN range(200000) AS t2 ON (t1.value + t2.value) % 1000 = 0;' Command being timed: "./target/release/join_limit_join_batch_size -c SELECT * FROM range(10000) AS t1 JOIN range(200000) AS t2 ON (t1.value + t2.value) % 1000 = 0;" User time (seconds): 27.68 System time (seconds): 19.36 Percent of CPU this job got: 2058% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.28 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 23530920 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 3207160 Voluntary context switches: 629 Involuntary context switches: 1771 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 /usr/bin/time -v ./target/release/1_left_row_join_right_batch -c ' SELECT * FROM range(10000) AS t1 JOIN range(200000) AS t2 ON (t1.value + t2.value) % 1000 = 0;' Command being timed: "./target/release/1_left_row_join_right_batch -c SELECT * FROM range(10000) AS t1 JOIN range(200000) AS t2 ON (t1.value + t2.value) % 1000 = 0;" User time (seconds): 11.50 System time (seconds): 0.06 Percent of CPU this job got: 1896% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.61 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 135744 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 5133 Voluntary context switches: 461 Involuntary context switches: 574 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` </details> My speculation is that the previous version _**suffered from memory management overload**_ due to need alloc large memory, as `perf` also indicated that the single kernel function `clear_page_erms`( a kernel function that efficiently zeroes out a page of memory using a fast CPU instruction.) was the top CPU consumer. ``` sudo perf report --no-children Samples: 215K of event 'cpu-clock:ppp', Event count (approx.): 53759750000 Overhead Command Shared Object Symbol + 25.15% tokio-runtime-w [kernel.kallsyms] [k] clear_page_erms + 6.90% tokio-runtime-w join_limit_join_batch_size [.] 0x0000000000df5b44 + 6.31% swapper [kernel.kallsyms] [k] pv_native_safe_halt + 4.37% tokio-runtime-w join_limit_join_batch_size [.] 0x0000000002bd63e5 + 4.32% tokio-runtime-w join_limit_join_batch_size [.] 0x0000000002bd63d5 + 3.81% tokio-runtime-w [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore + 3.10% tokio-runtime-w join_limit_join_batch_size [.] 0x0000000002bd61e7 + 2.99% tokio-runtime-w join_limit_join_batch_size [.] 0x0000000000e088a4 + 2.84% tokio-runtime-w join_limit_join_batch_size [.] 0x0000000002bd61bc ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org