HeartSaVioR commented on issue #22138: [SPARK-25151][SS] Apply Apache Commons 
Pool to KafkaDataConsumer
URL: https://github.com/apache/spark/pull/22138#issuecomment-468821919
 
 
   My bad. I just did experiment on pretty normal situation, and misinterpreted 
`concurrent`. 
   
   Here's revised one:
   https://gist.github.com/HeartSaVioR/bf14fa6040b0f85e6ff1ee613aa7976a
   
   - 20 partitions (data skew happened so 6 partitions have no data)
   - 4 Kafka Stream Sources which has slightly different offsets for same 
topicpartition (to experiment concurrent access)
   
   * master branch
   
   attempt \# | max | min | mean | median | percentile 90 | percentile 95 | 
percentile 99
   ---------- | ---- | ---- | ----- | ------- | ------------- | ------------- | 
--------------
   1 | 519 | 7 | 12.784 | 9.5 | 15 | 19 | 27.53
   2 | 491 | 7 | 12.228 | 9 | 13.1 | 18.55 | 32.02
   3 | 590 | 7 | 12.904 | 9 | 14 | 19 | 27
   
   * this patch (default value - no tune)
   
   attempt \# | max | min | mean | median | percentile 90 | percentile 95 | 
percentile 99
   ---------- | ---- | ---- |  ----- | ------- | ------------- | ------------- 
| --------------
   1 | 539 | 7 | 12.552 | 9 | 14.1 | 18.55 | 29.04
   2 | 610 | 7 | 12.56 | 9 | 14 | 17.55 | 27.02
   3 | 513 | 7 | 12.74 | 9.5 | 14.1 | 18.55 | 27.55
   
   Similarly, no noticeable difference. I'd curious a bit about the result of 
master branch, but there are many open possibilities on this (maybe master also 
handles correctly, or some bad cases - extra seek and fetch doesn't bring 
noticeable latency, master doesn't seek even offset is wrong, etc.). I could 
add some log message as well as validation of output to investigate whether 
they're handling the case correctly, but given that my patch has test regarding 
these cases - even if master handles the case correctly, it's same as master, 
at least.
   
   Btw, I may need to correct one:
   
   > That's the original case that started this whole issue, right?
   
   No. I even didn't indicate it when I raised this PR. That was your 
additional requirement and I just addressed it. Original rationalization of 
this patch was, this patch brings: 
   
   - various statistics regarding pool on JMX side which Apache Commons Pool 
provides (if my memory is right, there's existing JIRA issue regarding this)
   
   - gets rid of custom code for caching consumers
   (@tdas previously left a comment in JIRA or PR that it's ideal to migrate to 
existing solution, and this is the one - though I still have to craft some 
custom code to deal with fetched data)
   
   - able to close idle consumers in background (especially topicpartition 
becomes obsolete)
   
   - easier to add tests for consumer pool individually, with support to 
statistics
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to