Github user HeartSaVioR commented on the issue:

    https://github.com/apache/spark/pull/22138
  
    @koeninger 
    I'm not sure I got your point correctly. This patch is based on some 
assumptions, so please correct me if I'm missing here. Assumptions follow:
    
    1. There's actually no multiple consumers for a given key working at the 
same time. The cache key contains topic partition as well as group id. Even the 
query tries to do self-join so reading same topic in two different sources, I 
think group id should be different.
    
    2. In normal case the offset will be continuous, and that's why cache 
should help. In retrying case this patch invalidates cache as same as current 
behavior, so it should start from scratch.
    
    (Btw, I'm curious what's more expensive between `leveraging pooled object 
but resetting kafka consumer` vs `invalidating pooled objects and start from 
scratch`. Latter feels more safer but if we just need extra seek instead of 
reconnecting to kafka, resetting could be improved and former will be cheaper. 
I feel it is out of scope of my PR though.)
    
    This patch keeps most of current behaviors, except two spots I guess. I 
already commented a spot why I change the behavior, and I'll comment another 
spot for the same.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to