HeartSaVioR commented on pull request #31570:
URL: https://github.com/apache/spark/pull/31570#issuecomment-803218591


   I did some performance tests I did before, and I observed outstanding 
difference between revised my PR (my PR + state format used here) vs this PR.
   
   > revised my PR (versioned as 3.2.0-SPARK-10816-heartsavior)
   
   
https://github.com/HeartSaVioR/spark/tree/SPARK-10816-heartsavior-rebase-apply-PR-31570-versioned
   
   > this PR (versioned as 3.2.0-PR-31570)
   
   https://github.com/HeartSaVioR/spark/tree/PR-31570-versioned
   
   > benchmark code
   
   
https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/tree/benchmarking-SPARK-10816
   
   I built the benchmark code against locally installed Spark artifacts for 
both (that said, I built the benchmark code per each).
   
   Simple, change built.sbt, and run `sbt clean assembly`.
   
   > machine to run benchmark
   
   * AMD Ryzen 5600X (no overclock, 3.7 Ghz to 4.6 Ghz, 6 physical cores, 12 
logical cores)
   * DDR4 3200Mhz 16 GB * 2
   * Ubuntu 20.04
   
   Giving `local[*]` showed instability on performance so fixed the value to 8. 
There're not many physical cores so I reduced the number of partitions down to 
5 as well.
   
   > plenty of rows in session
   
   ```
   ./bin/spark-submit --master "local[8]" --conf spark.sql.shuffle.partitions=5 
--driver-memory 16g --class 
com.hortonworks.spark.benchmark.streaming.sessionwindow.plenty_of_rows_in_session.BenchmarkSessionWindowListenerWordCountSessionFunctionAppendMode
 ./iot-trucking-app-spark-structured-streaming-<version>.jar 
--query-status-file /tmp/a.json --rate-row-per-second 200000 
--rate-ramp-up-time-second 10
   ```
   
   
[plenty-of-rows-in-session-append-mode-mine-rate-200000-v1.txt](https://github.com/apache/spark/files/6174672/plenty-of-rows-in-session-append-mode-mine-rate-200000-v1.txt)
   
   
[plenty-of-rows-in-session-append-mode-PR-31570-rate-200000-v1.txt](https://github.com/apache/spark/files/6174674/plenty-of-rows-in-session-append-mode-PR-31570-rate-200000-v1.txt)
   
   * mine showed 160,000+ on processedRowsPerSecond.
   * PR-31570 didn't reach 60,000 on processedRowsPerSecond.
   
   > plenty of keys
   
   ```
   ./bin/spark-submit --master "local[8]" --conf spark.sql.shuffle.partitions=5 
--driver-memory 16g --class 
com.hortonworks.spark.benchmark.streaming.sessionwindow.plenty_of_keys.BenchmarkSessionWindowListenerWordCountSessionFunctionAppendMode
 ./iot-trucking-app-spark-structured-streaming-<version>.jar 
--query-status-file /tmp/b.json --rate-row-per-second 12000000 
--rate-ramp-up-time-second 10
   ```
   
   
[plenty-of-keys-append-mode-mine-rate-12000000-v1.txt](https://github.com/apache/spark/files/6174671/plenty-of-keys-append-mode-mine-rate-12000000-v1.txt)
   
   
[plenty-of-keys-append-mode-PR-31570-rate-12000000-v1.txt](https://github.com/apache/spark/files/6174675/plenty-of-keys-append-mode-PR-31570-rate-12000000-v1.txt)
   
   * mine showed "over" 12,000,000 on processedRowsPerSecond. (Probably could 
reach more if we increase rate.)
   * PR-31570 didn't reach 10,000,000 on processedRowsPerSecond.
   
   It'd be appreciated if anyone in reviewing can take the chance on 
performance test on their site and update the result. I'd love to see the 
result objecting my perf test (either my tests with different env/config or new 
tests), but if no one proves the result objecting mine, I guess we all know we 
need to make effort on the right direction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to