Github user HeartSaVioR commented on the issue:
https://github.com/apache/spark/pull/22138
@koeninger
I'm not sure but are you saying that an executor cares about multiple
queries (multiple jobs) concurrently? I honestly didn't notice it. If that is
going to be problem, we should add something (could we get query id at that
time?) in cache key to differentiate consumers. If we want to avoid extra
seeking due to different offsets, consumers should not be reused among with
multiple queries, and that's just a matter of cache key.
If you are thinking about co-use of consumers among multiple queries
because of reusing connection to Kafka, I think extra seeking is unavoidable (I
guess fetched data should be much more critical issue unless we never reuse
after returning to pool). If seeking is light operation, we may even go with
only reusing connection (not position we already sought): always resetting
position (and data maybe?) when borrowing from pool or returning consumer to
pool.
Btw, the rationalization of this patch is not solving the issue you're
referring. This patch is also based on #20767 but dealing with another
improvements pointed out in comments: adopt pool library to not reinvent the
wheel, and also enabling metrics regarding the pool.
I'm not sure the issue you're referring is a serious one (show-stopper): if
the issue is a kind of serious, someone should handle the issue once we are
aware of the issue at March, or at least relevant JIRA issue should be filed
with detailed explanation before. I'd like to ask you in favor of handling (or
filing) the issue since you may know the issue best.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]