[ https://issues.apache.org/jira/browse/KAFKA-9953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147635#comment-17147635 ]
Joost van de Wijgerd commented on KAFKA-9953: --------------------------------------------- Hi [~bchen225242] , I agree that we are not using the recommended usage pattern. The issue however is that the pattern we are using is actually fully functional (we have been using this in production for 9 months now) but due to the implementation detail of the TransactionManager that only supports one GroupCoordinator it keeps on 'flipping' between group coordinators and due to the default retry timeout a 100ms time penalty is incurred every time this happens (since we have discovered this issue we have set the timeout to 0ms). I have actually patched kafka-clients 2.5.0 with my fix and we are currently running this in production with no issues whatsoever. As to your point of the consumer groups rebalancing; correct me if I am wrong but I think this has no impact in the location of the ConsumerGroupCoordinator on the Broker. My fix is merely keeping track of which Broker hosts a given ConsumerGroupCoordinator so I don't see how this would be an issue. I do agree with you that if you use the the many -> one mapping that you cannot/should not use the automatic rebalancing. We are indeed using our own assignment strategy because we want partitions of different Topics with the same ordinal to map to the same application instance. If we would let Kafka do the allocation I indeed think this pattern would not work correctly. I am sticking to my standpoint that implementing this improvement does not hurt the current recommended pattern at all but it does support the many to one pattern in a performant way. I don't think you have to update your documentation for this unless you want to specifically point this out to your users. If you decide to not implement this improvement I would opt to log a WARN message that alerts the developer to this issue so they can fix the problem in an early stage of development (currently there is an INFO message when a new ConsumerGroupCoordinator is found, this was my only clue to finding the problem and unfortunately this was after we implemented our framework around the many consumer > one producer concept) To answer your question: implementing a proper switch to the one to one Consumer Producer mapping would be a big change for us, pairing extra Producers to our existing Consumers should be a lot easier but we would essentially be using them to implement the Map of ConsumerGroupCoordinators so for me it is then a better option to run with a patched kafka-clients library. However on the long run this is also not very sustainable. Best Regards, Joost > support multiple consumerGroupCoordinators in TransactionManager > ---------------------------------------------------------------- > > Key: KAFKA-9953 > URL: https://issues.apache.org/jira/browse/KAFKA-9953 > Project: Kafka > Issue Type: Improvement > Components: clients > Affects Versions: 2.5.0 > Reporter: Joost van de Wijgerd > Priority: Major > Attachments: KAFKA-9953.patch > > > We are using kafka with a transactional producer and have the following use > case: > 3 KafkaConsumers (each with their own ConsumerGroup) polled by the same > thread and 1 transactional kafka producer. When we add the offsets to the > transaction we run into the following problem: > TransactionManager only keeps track of 1 consumerGroupCoordinator, however it > can be that some consumerGroupCoordinators are on another node, now we > constantly see the TransactionManager switching between nodes, this has > overhead of 1 failing _TxnOffsetCommitRequest_ and 1 unnecessary > _FindCoordinatorRequest_. > Also with _retry.backoff.ms_ set to 100 by default this is causing a pause > of 100ms for every other transaction (depending on what KafkaConsumer > triggered the transaction of course) > If the TransactionManager could keep track of coordinator nodes per > consumerGroupId this problem would be solved. > I have already a patch for this but still need to test it. Will add it to the > ticket when that is done -- This message was sent by Atlassian Jira (v8.3.4#803005)