[GitHub] koeninger commented on issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer

GitBox Fri, 01 Mar 2019 13:58:41 -0800

koeninger commented on issue #22138: [SPARK-25151][SS] Apply Apache Commons 
Pool to KafkaDataConsumer
URL: https://github.com/apache/spark/pull/22138#issuecomment-468826169
 
 
   Not trying to beat a dead horse here, but it's not my "additional
   requirement" -the whole reason caching was changed from the original naive
   approach was because of concurrent consumers for the same topicpartition +
   group, see https://github.com/apache/spark/pull/20767
   
   I was under impression that self-joins were the most likely reason that
   would happen.  I wouldn't expect the union of different streams in your
   latest test to be exercising that case; are you sure it is?
   
   
   
   
   On Fri, Mar 1, 2019 at 3:43 PM Jungtaek Lim <[email protected]>
   wrote:
   
   > My bad. I just did experiment on pretty normal situation, and
   > misinterpreted concurrent.
   >
   > Here's revised one:
   > https://gist.github.com/HeartSaVioR/bf14fa6040b0f85e6ff1ee613aa7976a
   >
   >    - 20 partitions (data skew happened so 6 partitions have no data)
   >    - 4 Kafka Stream Sources which has slightly different offsets for same
   >    topicpartition (to experiment concurrent access)
   >
   >
   >    - master branch
   >
   > attempt # max min mean median percentile 90 percentile 95 percentile 99
   > 1 519 7 12.784 9.5 15 19 27.53
   > 2 491 7 12.228 9 13.1 18.55 32.02
   > 3 590 7 12.904 9 14 19 27
   >
   >    - this patch (default value - no tune)
   >
   > attempt # max min mean median percentile 90 percentile 95 percentile 99
   > 1 539 7 12.552 9 14.1 18.55 29.04
   > 2 610 7 12.56 9 14 17.55 27.02
   > 3 513 7 12.74 9.5 14.1 18.55 27.55
   >
   > Similarly, no noticeable difference. I'd curious a bit about the result of
   > master branch, but there are many open possibilities on this (maybe master
   > also handles correctly, or some bad cases - extra seek and fetch doesn't
   > bring noticeable latency, master doesn't seek even offset is wrong, etc.).
   > I could add some log message as well as validation of output to investigate
   > whether they're handling the case correctly, but given that my patch has
   > test regarding these cases - even if master handles the case correctly,
   > it's same as master, at least.
   >
   > Btw, I may need to correct one:
   >
   > That's the original case that started this whole issue, right?
   >
   > No. I even didn't indicate it when I raised this PR. That was your
   > additional requirement and I just addressed it. Original rationalization of
   > this patch was, this patch brings:
   >
   >    -
   >
   >    various statistics regarding pool on JMX side which Apache Commons
   >    Pool provides (if my memory is right, there's existing JIRA issue 
regarding
   >    this)
   >    -
   >
   >    gets rid of custom code for caching consumers
   >    (@tdas <https://github.com/tdas> previously left a comment in JIRA or
   >    PR that it's ideal to migrate to existing solution, and this is the one 
-
   >    though I still have to craft some custom code to deal with fetched data)
   >    -
   >
   >    able to close idle consumers in background (especially topicpartition
   >    becomes obsolete)
   >    -
   >
   >    easier to add tests for consumer pool individually, with support to
   >    statistics
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/spark/pull/22138#issuecomment-468821919>, or 
mute
   > the thread
   > 
<https://github.com/notifications/unsubscribe-auth/AAGAB76eiwp5wKfr9z0ZCLb7VCXtjlmoks5vSZ8WgaJpZM4WCUJs>
   > .
   >


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] koeninger commented on issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer

Reply via email to