[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...

HeartSaVioR Sun, 19 Aug 2018 16:51:02 -0700

Github user HeartSaVioR commented on the issue:

    https://github.com/apache/spark/pull/22138
  
    @koeninger 
    I'm not sure I got your point correctly. This patch is based on some 
assumptions, so please correct me if I'm missing here. Assumptions follow:
    
    1. There's actually no multiple consumers for a given key working at the 
same time. The cache key contains topic partition as well as group id. Even the 
query tries to do self-join so reading same topic in two different sources, I 
think group id should be different.
    
    2. In normal case the offset will be continuous, and that's why cache 
should help. In retrying case this patch invalidates cache as same as current 
behavior, so it should start from scratch.
    
    (Btw, I'm curious what's more expensive between `leveraging pooled object 
but resetting kafka consumer` vs `invalidating pooled objects and start from 
scratch`. Latter feels more safer but if we just need extra seek instead of 
reconnecting to kafka, resetting could be improved and former will be cheaper. 
I feel it is out of scope of my PR though.)
    
    This patch keeps most of current behaviors, except two spots I guess. I 
already commented a spot why I change the behavior, and I'll comment another 
spot for the same.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...

Reply via email to