[
https://issues.apache.org/jira/browse/KAFKA-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870057#comment-17870057
]
Greg Harris commented on KAFKA-17232:
-------------------------------------
Thanks [~frankvicky] for volunteering. Some thoughts I had about this ticket:
* The bug is actually also present in MirrorSourceConnector, for the source and
target topic partitions loading logic. It just appeared in the
MirrorCheckpointConnector first in our case.
* We may want to avoid returning an empty config list after start() and before
the consumer groups get loaded. If we return an empty config list, that will
shut down the tasks, when they could keep going unaffected. We should throw an
exception from taskConfigs to indicate we're not ready to give configs, and let
the framework retry.
> MirrorCheckpointConnector does not generate task configs if initial consumer
> group load times out
> -------------------------------------------------------------------------------------------------
>
> Key: KAFKA-17232
> URL: https://issues.apache.org/jira/browse/KAFKA-17232
> Project: Kafka
> Issue Type: Bug
> Components: mirrormaker
> Affects Versions: 3.9.0
> Reporter: Greg Harris
> Priority: Major
>
> The MirrorCheckpointConnector has two operations that read the source
> consumer groups:
> * loadInitialConsumerGroups
> * refreshConsumerGroups
> loadInitialConsumerGroups blocks the start() method of the connector, while
> refreshConsumerGroups is asynchronous and runs periodically while the
> connector is running.
> loadInitialConsumerGroups may take a long time to execute, and may exceed the
> configured "admin.timeout.ms" used by the Scheduler. This timeout is logged
> and the start() method returns normally. If this happens, the framework will
> generate task configs immediately after start(), before
> loadInitialConsumerGroups can finish, and will generate an empty set of task
> configs:
> [https://github.com/apache/kafka/blob/e2494e6ffb89f8288ed2aeb9b5596c755210bffd/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointConnector.java#L118-L121].
> Later, when loadInitialConsumerGroups completes, it will not request task
> reconfiguration, believing it is the initial load operation.
> Later still, when refreshConsumerGroups completes, it will not request task
> reconfiguration, as the set of consumer groups has not changed since the
> initial load:
> [https://github.com/apache/kafka/blob/e2494e6ffb89f8288ed2aeb9b5596c755210bffd/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointConnector.java#L173-L180]
>
> This leads to a situation where the MirrorCheckpointConnector believes it has
> converged with nothing to update, but actually has consumer groups that are
> not allocated to tasks.
> This happens particularly for large, stable Kafka clusters with many consumer
> groups that are not being actively created or deleted.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)