HeartSaVioR commented on pull request #31570: URL: https://github.com/apache/spark/pull/31570#issuecomment-803218591
I did some performance tests I did before, and I observed outstanding difference between revised my PR (my PR + state format used here) vs this PR. > revised my PR (versioned as 3.2.0-SPARK-10816-heartsavior) https://github.com/HeartSaVioR/spark/tree/SPARK-10816-heartsavior-rebase-apply-PR-31570-versioned > this PR (versioned as 3.2.0-PR-31570) https://github.com/HeartSaVioR/spark/tree/PR-31570-versioned > benchmark code https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/tree/benchmarking-SPARK-10816 I built the benchmark code against locally installed Spark artifacts for both (that said, I built the benchmark code per each). Simple, change built.sbt, and run `sbt clean assembly`. > machine to run benchmark * AMD Ryzen 5600X (no overclock, 3.7 Ghz to 4.6 Ghz, 6 physical cores, 12 logical cores) * DDR4 3200Mhz 16 GB * 2 * Ubuntu 20.04 Giving `local[*]` showed instability on performance so fixed the value to 8. There're not many physical cores so I reduced the number of partitions down to 5 as well. > plenty of rows in session ``` ./bin/spark-submit --master "local[8]" --conf spark.sql.shuffle.partitions=5 --driver-memory 16g --class com.hortonworks.spark.benchmark.streaming.sessionwindow.plenty_of_rows_in_session.BenchmarkSessionWindowListenerWordCountSessionFunctionAppendMode ./iot-trucking-app-spark-structured-streaming-<version>.jar --query-status-file /tmp/a.json --rate-row-per-second 200000 --rate-ramp-up-time-second 10 ``` [plenty-of-rows-in-session-append-mode-mine-rate-200000-v1.txt](https://github.com/apache/spark/files/6174672/plenty-of-rows-in-session-append-mode-mine-rate-200000-v1.txt) [plenty-of-rows-in-session-append-mode-PR-31570-rate-200000-v1.txt](https://github.com/apache/spark/files/6174674/plenty-of-rows-in-session-append-mode-PR-31570-rate-200000-v1.txt) * mine showed 160,000+ on processedRowsPerSecond. * PR-31570 didn't reach 60,000 on processedRowsPerSecond. > plenty of keys ``` ./bin/spark-submit --master "local[8]" --conf spark.sql.shuffle.partitions=5 --driver-memory 16g --class com.hortonworks.spark.benchmark.streaming.sessionwindow.plenty_of_keys.BenchmarkSessionWindowListenerWordCountSessionFunctionAppendMode ./iot-trucking-app-spark-structured-streaming-<version>.jar --query-status-file /tmp/b.json --rate-row-per-second 12000000 --rate-ramp-up-time-second 10 ``` [plenty-of-keys-append-mode-mine-rate-12000000-v1.txt](https://github.com/apache/spark/files/6174671/plenty-of-keys-append-mode-mine-rate-12000000-v1.txt) [plenty-of-keys-append-mode-PR-31570-rate-12000000-v1.txt](https://github.com/apache/spark/files/6174675/plenty-of-keys-append-mode-PR-31570-rate-12000000-v1.txt) * mine showed "over" 12,000,000 on processedRowsPerSecond. (Probably could reach more if we increase rate.) * PR-31570 didn't reach 10,000,000 on processedRowsPerSecond. It'd be appreciated if anyone in reviewing can take the chance on performance test on their site and update the result. I'd love to see the result objecting my perf test (either my tests with different env/config or new tests), but if no one proves the result objecting mine, I guess we all know we need to make effort on the right direction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
