Github user HeartSaVioR commented on the issue:
https://github.com/apache/spark/pull/22138
@koeninger
I'm not sure I got your point correctly. This patch is based on some
assumptions, so please correct me if I'm missing here. Assumptions follow:
1. There's actually no multiple consumers for a given key working at the
same time. The cache key contains topic partition as well as group id. Even the
query tries to do self-join so reading same topic in two different sources, I
think group id should be different.
2. In normal case the offset will be continuous, and that's why cache
should help. In retrying case this patch invalidates cache as same as current
behavior, so it should start from scratch.
(Btw, I'm curious what's more expensive between `leveraging pooled object
but resetting kafka consumer` vs `invalidating pooled objects and start from
scratch`. Latter feels more safer but if we just need extra seek instead of
reconnecting to kafka, resetting could be improved and former will be cheaper.
I feel it is out of scope of my PR though.)
This patch keeps most of current behaviors, except two spots I guess. I
already commented a spot why I change the behavior, and I'll comment another
spot for the same.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]