A. Sophie Blee-Goldman created KAFKA-12477:
----------------------------------------------
Summary: Smart rebalancing with dynamic protocol selection
Key: KAFKA-12477
URL: https://issues.apache.org/jira/browse/KAFKA-12477
Project: Kafka
Issue Type: Improvement
Components: consumer
Reporter: A. Sophie Blee-Goldman
Fix For: 3.0.0
Users who want to upgrade their applications and enable the COOPERATIVE
rebalancing protocol in their consumer apps are required to follow a double
rolling bounce upgrade path. The reason for this is laid out in the [Consumer
Upgrades|https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol#KIP429:KafkaConsumerIncrementalRebalanceProtocol-Consumer]
section of KIP-429. Basically, the ConsumerCoordinator picks a rebalancing
protocol in its constructor based on the list of supported partition assignors.
The protocol is selected as the highest protocol that is commonly supported by
all assignors in the list, and never changes after that.
This is a bit unfortunate because it may end up using an older protocol even
after every member in the group has been updated to support the newer protocol.
After the first rolling bounce of the upgrade, all members will have two
assignors: "cooperative-sticky" and "range" (or sticky/round-robin/etc). At
this point the EAGER protocol will still be selected due to the presence of the
"range" assignor, but it's the "cooperative-sticky" assignor that will
ultimately be selected for use in rebalances if that assignor is preferred (ie
positioned first in the list). The only reason for the second rolling bounce is
to strip off the "range" assignor and allow the upgraded members to switch over
to COOPERATIVE. We can't allow them to use cooperative rebalancing until
everyone has been upgraded, but once they have it's safe to do so.
And there is already a way for the client to detect that everyone is on the new
byte code: if the CooperativeStickyAssignor is selected by the group
coordinator, then that means it is supported by all consumers in the group and
therefore everyone must be upgraded.
We may be able to save the second rolling bounce by dynamically updating the
rebalancing protocol inside the ConsumerCoordinator as "the highest protocol
supported by the assignor chosen by the group coordinator". This means we'll
still be using EAGER at the first rebalance, since we of course need to wait
for this initial rebalance to get the response from the group coordinator. But
we should take the hint from the chosen assignor rather than dropping this
information on the floor and sticking with the original protocol
--
This message was sent by Atlassian Jira
(v8.3.4#803005)