[GitHub] [spark] HeartSaVioR edited a comment on issue #25853: [SPARK-21869][SS] Apply Apache Commons Pool to Kafka producer

GitBox Wed, 30 Oct 2019 08:46:55 -0700

HeartSaVioR edited a comment on issue #25853: [SPARK-21869][SS] Apply Apache 
Commons Pool to Kafka producer
URL: https://github.com/apache/spark/pull/25853#issuecomment-547973185
 
 
   I guess the patch wouldn't matter if the change doesn't bring considerable 
performance hit, as this patch resolves existing issue as well as provides 
metrics around pool.
   
   For observing performance perspective I've crafted benchmark code which the 
query populates the input rows from rate source and provide to the Kafka sink 
(producer).
   
   
https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/commit/3d68bbd83cea4467e27c9d44e0288362d107a910
   
   (`sbt assembly` will build the project properly, but it refers 
3.0.0-SNAPSHOT so you should run `mvn clean install -DskipTests` against Spark 
repo in prior to build this project.)
   
   I ran Spark master and worker as standalone, and ran below command to run 
benchmark:
   
   > master branch (baseline: 67e1360bad)
   
   ```
   ./bin/spark-submit --master spark://localhost:7077 --class 
com.hortonworks.spark.benchmark.streaming.kafka.KafkaProducerBenchmarkRunner 
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0-master-67e1360bad 
iot-trucking-app-spark-structured-streaming-without-spark-sql-kafka.jar 
--query-status-file="master-query-status-rate-950000-numpart-10-v1.log" 
--rate-row-per-second=950000 --rate-ramp-up-time-second=60 --output-mode=Append 
--num-partitions=10 --bootstrap-servers="localhost:9092" 
--output-topic=spark21869
   ```
   
   > this patch
   
   ```
   ./bin/spark-submit --master spark://localhost:7077 --class 
com.hortonworks.spark.benchmark.streaming.kafka.KafkaProducerBenchmarkRunner 
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0-SPARK-21869 
iot-trucking-app-spark-structured-streaming-without-spark-sql-kafka.jar 
--query-status-file="SPARK-21869-query-status-rate-950000-numpart-10-v1.log" 
--rate-row-per-second=950000 --rate-ramp-up-time-second=60 --output-mode=Append 
--num-partitions=10 --bootstrap-servers="localhost:9092" 
--output-topic=spark21869
   ```
   
   * `query-status-file`: file to write streaming query status via streaming 
query listener
   * `rate-row-per-second`: target rate (rows per second)
   * `rate-ramp-up-time-second`: ramp-up time, unit: second
   * `num-partitions`: the number of partitions for rate source - we don't 
repartition so tasks for sink are meant to be created as same as this value
   
   Assuming the streaming listener file is 
`SPARK-21869-new-query-status-rate-950000-numpart-10-v1.log`, below command 
filters out empty batch and picks the batch range from 70 to 130, and store to 
another file.
   
   ```
   cat SPARK-21869-new-query-status-rate-950000-numpart-10-v1.log | grep -v 
"\"numInputRows\":0" | sed -n '70,130p' > 
SPARK-21869-new-query-status-rate-950000-numpart-10-v1-exclude-input-rows-0.log
   ```
   
   and below command measures some histogram from "addBatch":
   
   ```
   cat 
SPARK-21869-new-query-status-rate-950000-numpart-10-v1-exclude-input-rows-0.log 
| grep "addBatch" | jq '. | {addBatch: .durationMs.addBatch}' | grep "addBatch" 
| awk -F " " '{print $2}' | datamash max 1 min 1 mean 1 median 1 perc:90 1 
perc:95 1 perc:99 1
   ```
   
   commit | trial# | max | min | median | perc 90 | perc 95 | perc 99
   ------- | ----- | ----- | --- | ------- | --------- | ------- | --------
   master | 1 | 571 | 395 | 441.98333333333 | 439 | 473.1 | 483.4 | 530.88
   master | 2 | 520 | 402 | 434.49180327869 | 433 | 457 | 471 | 509.2
   SPARK-21869 | 1 | 2100 | 652 | 846.65573770492 | 728 | 1033 | 1566 | 2100
   SPARK-21869 | 2 | 665 | 384 | 515.98360655738 | 506 | 605 | 628 | 645.8
   
   I don't have dedicated machine to run the tests so actually the results 
fluctuated a lot - you may want to run the test with stable machine like EC2 
with dedicated option, but just want to share how to test and extract the 
numbers.
   
   Actually I experimented with more rates as well but due to the fluctuation I 
couldn't find the point which either is keeping up and other is not.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR edited a comment on issue #25853: [SPARK-21869][SS] Apply Apache Commons Pool to Kafka producer

Reply via email to