HeartSaVioR edited a comment on issue #25853: [SPARK-21869][SS] Apply Apache Commons Pool to Kafka producer URL: https://github.com/apache/spark/pull/25853#issuecomment-547973185 I guess the patch wouldn't matter if the change doesn't bring considerable performance hit, as this patch resolves existing issue as well as provides metrics around pool. For observing performance perspective I've crafted benchmark code which the query populates the input rows from rate source and provide to the Kafka sink (producer). https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/commit/3d68bbd83cea4467e27c9d44e0288362d107a910 (`sbt assembly` will build the project properly, but it refers 3.0.0-SNAPSHOT so you should run `mvn clean install -DskipTests` against Spark repo in prior to build this project.) I ran Spark master and worker as standalone, and ran below command to run benchmark: > master branch (baseline: 67e1360bad) ``` ./bin/spark-submit --master spark://localhost:7077 --class com.hortonworks.spark.benchmark.streaming.kafka.KafkaProducerBenchmarkRunner --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0-master-67e1360bad iot-trucking-app-spark-structured-streaming-without-spark-sql-kafka.jar --query-status-file="master-query-status-rate-950000-numpart-10-v1.log" --rate-row-per-second=950000 --rate-ramp-up-time-second=60 --output-mode=Append --num-partitions=10 --bootstrap-servers="localhost:9092" --output-topic=spark21869 ``` > this patch ``` ./bin/spark-submit --master spark://localhost:7077 --class com.hortonworks.spark.benchmark.streaming.kafka.KafkaProducerBenchmarkRunner --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0-SPARK-21869 iot-trucking-app-spark-structured-streaming-without-spark-sql-kafka.jar --query-status-file="SPARK-21869-query-status-rate-950000-numpart-10-v1.log" --rate-row-per-second=950000 --rate-ramp-up-time-second=60 --output-mode=Append --num-partitions=10 --bootstrap-servers="localhost:9092" --output-topic=spark21869 ``` * `query-status-file`: file to write streaming query status via streaming query listener * `rate-row-per-second`: target rate (rows per second) * `rate-ramp-up-time-second`: ramp-up time, unit: second * `num-partitions`: the number of partitions for rate source - we don't repartition so tasks for sink are meant to be created as same as this value Assuming the streaming listener file is `SPARK-21869-new-query-status-rate-950000-numpart-10-v1.log`, below command filters out empty batch and picks the batch range from 70 to 130, and store to another file. ``` cat SPARK-21869-new-query-status-rate-950000-numpart-10-v1.log | grep -v "\"numInputRows\":0" | sed -n '70,130p' > SPARK-21869-new-query-status-rate-950000-numpart-10-v1-exclude-input-rows-0.log ``` and below command measures some histogram from "addBatch": ``` cat SPARK-21869-new-query-status-rate-950000-numpart-10-v1-exclude-input-rows-0.log | grep "addBatch" | jq '. | {addBatch: .durationMs.addBatch}' | grep "addBatch" | awk -F " " '{print $2}' | datamash max 1 min 1 mean 1 median 1 perc:90 1 perc:95 1 perc:99 1 ``` commit | trial# | max | min | median | perc 90 | perc 95 | perc 99 ------- | ----- | ----- | --- | ------- | --------- | ------- | -------- master | 1 | 571 | 395 | 441.98333333333 | 439 | 473.1 | 483.4 | 530.88 master | 2 | 520 | 402 | 434.49180327869 | 433 | 457 | 471 | 509.2 SPARK-21869 | 1 | 2100 | 652 | 846.65573770492 | 728 | 1033 | 1566 | 2100 SPARK-21869 | 2 | 665 | 384 | 515.98360655738 | 506 | 605 | 628 | 645.8 I don't have dedicated machine to run the tests so actually the results fluctuated a lot - you may want to run the test with stable machine like EC2 with dedicated option, but just want to share how to test and extract the numbers. Actually I experimented with more rates as well but due to the fluctuation I couldn't find the point which either is keeping up and other is not.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
