HeartSaVioR commented on issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer URL: https://github.com/apache/spark/pull/22138#issuecomment-468694114 @koeninger I've just run some experiments on my local dev. The source topic has around 1,100,000 records in 10 partitions. Records are generated by below utils: https://github.com/HeartSaVioR/sam-trucking-data-utils/blob/fix-for-spark-structured-streaming/README.md (Actually we will pull exactly same records per query so it would not be a big deal.) Test code is below: https://gist.github.com/HeartSaVioR/74c7e78e5901b1974ccc400502fb6af2 The query fetched 5000 records per batch: each query ran 221 batches. Query status file is parsed via below command (requires jq and datamash): ``` cat experiment-SPARK-25151-master-query-v1.log | grep "addBatch" | jq '. | {addBatch: .durationMs.addBatch}' | grep "addBatch" | awk -F " " '{print $2}' | datamash max 1 min 1 mean 1 median 1 perc:90 1 perc:95 1 perc:99 1 ``` * master branch attempt \# | max | min | mean | median | percentile 90 | percentile 95 | percentile 99 ---------- | ---- | ---- | ----- | ------- | ------------- | ------------- | -------------- 1 | 449 | 4 | 7.6981981981982 | 5 | 7 | 11 | 21.37 2 | 490 | 4 | 7.8198198198198 | 5 | 8 | 10.95 | 19.37 3 | 442 | 4 | 7.4324324324324 | 5 | 8 | 10 | 16.16 * this patch attempt \# | max | min | mean | median | percentile 90 | percentile 95 | percentile 99 ---------- | ---- | ---- | ----- | ------- | ------------- | ------------- | -------------- 1 | 501 | 4 | 7.8513513513514 | 5 | 7.9 | 9.95 | 18.79 2 | 411 | 4 | 7.4054054054054 | 5 | 8 | 9 | 16.37 3 | 431 | 3 | 7.5630630630631 | 5 | 8 | 11 | 16 I would not say this patch is faster than current based the output (it shows better numbers for percentile 95 and 99 though), but I would say it doesn't bring performance regression, while bringing bugfix and improvements. You could see the number doesn't contribute much on overall latency per batch - once we notice the cache logic works well, it's not a critical path and the overhead for retrieving is pretty ignorable. Could the result persuade you to review the patch? Or do you want to tune the parameter in test env?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
