Barnabas Maidics created KAFKA-12798:
----------------------------------------

             Summary: Fixing MM2 rebalance timeout issue when source cluster is 
not available
                 Key: KAFKA-12798
                 URL: https://issues.apache.org/jira/browse/KAFKA-12798
             Project: Kafka
          Issue Type: Bug
          Components: mirrormaker, replication
            Reporter: Barnabas Maidics


If the network configuration of a source cluster which is taking part in a 
replication flow is changed (change of port number, if, for instance TLS is 
enabled or disabled) MirrorMaker2 won't update its internal configuration even 
after a reconfiguration followed by a restart.

What happens in MirrorMaker2 after a cluster "identity" (i.e. connectivity 
config) changes:
 # MM2 driver (MirrorMaker class) starts up with the new config.
 # DistributedHerder joins a dedicated consumer group that decides which driver 
instance has control over the assignments and the configuration topic.
 # The driver caches the consumer group assignment, which indicates that it is 
the leader of the group.
 # The driver reads the configuration topic (which is still not containing the 
new config), and starts the mm connectors.
 # Since the old config is invalid, the connectors cannot connect to the 
cluster anymore - MirrorSourceConnector tries to query the cluster through the 
admin client, but the queries time out after 2 minutes (it contains 2 tasks 
affecting the source cluster, the timeout is 1 minute for both).
 ## In the meantime, the background heartbeat thread checks on the state of the 
herder consumer membership. There is a default rebalance timeout of 1 minute. 
Since the herder thread was blocked due to the connector query timeouts, it 
wasn't able to call poll on the consumer. Heartbeat thread invalidates the 
consumer membership and triggers a new consumer creation.
 # The herder thread finishes the connector startup, and after realizing that 
the configuration has changed, tries to update the config topic.
 ## The config topic can only be updated by the leader herder.
 ## The driver checks the group assignment to see if it is the leader.
 ## In the local cache, the old assignment is present, in which the leader is 
the previous consumer with its own ID.
 ## The current consumer ID of the driver does not match the cached leader ID.
 # The driver refuses to update the config topic.

[~durban], thanks for digging deeper into this issue

*The proposed fix for this:*
The rebalance issue can be fixed by decreasing the time that we wait for tasks 
that affects the source cluster at the start of MM2. By decreasing the timeout 
(from 1 minute to 15 seconds by default), if the kafka config is old, the tasks 
affecting the source cluster won't block for too long. With this the herder 
will be able to update the config topic. This timout is configurable now and 
defaults to 15 seconds.

Also needed to increase the number of threads in the scheduler so that other 
tasks won't be blocked.

*Testing done:* 
 #  configure replication between source->target
 #  checked that the replication is working
 #  change source kafka cluster broker port
 #  restart kafka/mirrormaker2, produced new messages in the replicated topic
 #  after the restart mm2 was trying to use the old kafka configs, and even 
after a long time, it couldn't replicate. After applying the fix, the issue was 
solved, replication worked.

Also tested with the same scenario, but instead of changing the port, ssl was 
turned on the source kafka cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to