HeartSaVioR commented on issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer URL: https://github.com/apache/spark/pull/22138#issuecomment-468821919 My bad. I just did experiment on pretty normal situation, and misinterpreted `concurrent`. Here's revised one: https://gist.github.com/HeartSaVioR/bf14fa6040b0f85e6ff1ee613aa7976a - 20 partitions (data skew happened so 6 partitions have no data) - 4 Kafka Stream Sources which has slightly different offsets for same topicpartition (to experiment concurrent access) * master branch attempt \# | max | min | mean | median | percentile 90 | percentile 95 | percentile 99 ---------- | ---- | ---- | ----- | ------- | ------------- | ------------- | -------------- 1 | 519 | 7 | 12.784 | 9.5 | 15 | 19 | 27.53 2 | 491 | 7 | 12.228 | 9 | 13.1 | 18.55 | 32.02 3 | 590 | 7 | 12.904 | 9 | 14 | 19 | 27 * this patch (default value - no tune) attempt \# | max | min | mean | median | percentile 90 | percentile 95 | percentile 99 ---------- | ---- | ---- | ----- | ------- | ------------- | ------------- | -------------- 1 | 539 | 7 | 12.552 | 9 | 14.1 | 18.55 | 29.04 2 | 610 | 7 | 12.56 | 9 | 14 | 17.55 | 27.02 3 | 513 | 7 | 12.74 | 9.5 | 14.1 | 18.55 | 27.55 Similarly, no noticeable difference. I'd curious a bit about the result of master branch, but there are many open possibilities on this (maybe master also handles correctly, or some bad cases - extra seek and fetch doesn't bring noticeable latency, master doesn't seek even offset is wrong, etc.). I could add some log message as well as validation of output to investigate whether they're handling the case correctly, but given that my patch has test regarding these cases - even if master handles the case correctly, it's same as master, at least. Btw, I may need to correct one: > That's the original case that started this whole issue, right? No. I even didn't indicate it when I raised this PR. That was your additional requirement and I just addressed it. Original rationalization of this patch was, this patch brings: - various statistics regarding pool on JMX side which Apache Commons Pool provides (if my memory is right, there's existing JIRA issue regarding this) - gets rid of custom code for caching consumers (@tdas previously left a comment in JIRA or PR that it's ideal to migrate to existing solution, and this is the one - though I still have to craft some custom code to deal with fetched data) - able to close idle consumers in background (especially topicpartition becomes obsolete) - easier to add tests for consumer pool individually, with support to statistics
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
