HeartSaVioR commented on issue #22138: [SPARK-25151][SS] Apply Apache Commons 
Pool to KafkaDataConsumer
URL: https://github.com/apache/spark/pull/22138#issuecomment-468694114
 
 
   @koeninger 
   I've just run some experiments on my local dev.
   
   The source topic has around 1,100,000 records in 10 partitions. 
   
   Records are generated by below utils:
   
https://github.com/HeartSaVioR/sam-trucking-data-utils/blob/fix-for-spark-structured-streaming/README.md
   (Actually we will pull exactly same records per query so it would not be a 
big deal.)
   
   Test code is below:
   https://gist.github.com/HeartSaVioR/74c7e78e5901b1974ccc400502fb6af2
   
   The query fetched 5000 records per batch: each query ran 221 batches.
   
   Query status file is parsed via below command (requires jq and datamash):
   
   ```
   cat experiment-SPARK-25151-master-query-v1.log | grep "addBatch" | jq '. | 
{addBatch: .durationMs.addBatch}' | grep "addBatch" | awk -F " " '{print $2}' | 
datamash max 1 min 1 mean 1 median 1 perc:90 1 perc:95 1 perc:99 1
   ```
   
   * master branch
   
   attempt \# | max | min | mean | median | percentile 90 | percentile 95 | 
percentile 99
   ---------- | ---- | ---- | ----- | ------- | ------------- | ------------- | 
--------------
   1 | 449 | 4 | 7.6981981981982 | 5 | 7 | 11 | 21.37
   2 | 490 | 4 | 7.8198198198198 | 5 | 8 | 10.95 | 19.37
   3 | 442 | 4 | 7.4324324324324 | 5 | 8 | 10 | 16.16
   
   * this patch
   
   attempt \# | max | min | mean | median | percentile 90 | percentile 95 | 
percentile 99
   ---------- | ---- | ---- |  ----- | ------- | ------------- | ------------- 
| --------------
   1 | 501 | 4 | 7.8513513513514 | 5 | 7.9 | 9.95 | 18.79
   2 | 411 | 4 | 7.4054054054054 | 5 | 8 | 9 | 16.37
   3 | 431 | 3 | 7.5630630630631 | 5 | 8 | 11 | 16
   
   I would not say this patch is faster than current based the output (it shows 
better numbers for percentile 95 and 99 though), but I would say it doesn't 
bring performance regression, while bringing bugfix and improvements.
   
   You could see the number doesn't contribute much on overall latency per 
batch - once we notice the cache logic works well, it's not a critical path and 
the overhead for retrieving is pretty ignorable.
   
   Could the result persuade you to review the patch? Or do you want to tune 
the parameter in test env?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to