Shichao An created KAFKA-19633: ---------------------------------- Summary: Kafka Connect connectors sent out zombie records during rebalance Key: KAFKA-19633 URL: https://issues.apache.org/jira/browse/KAFKA-19633 Project: Kafka Issue Type: Bug Components: connect Affects Versions: 3.2.0 Reporter: Shichao An
Hi, we run Debezium connectors on Kafka Connect. We identified several "zombie" records that are delivered by the connectors during or after the rebalance. Since the downstream consumers require ordering, this issue breaks several things where previous primitives were build upon. Here are an overview of the setup: * Connector type: Debezium Mongo Connector * Kafka Connect version: 3.2 * Number of workers: 3-4 * Kafka producer configs: at-least once settings, ack=all, max inflight requests=1 The following conclusion are based on our investigation: {quote}When a Kafka Connect worker (part of a connector cluster) is overloaded or degraded, the connector on it may become temporarily unhealthy. The Kafka Connect cluster will rebalance the connector by "moving" it to another worker. When the connector is started on the new worker, the events will resume normally without any data loss and depending on the previously committed offsets, there might be a small amount of duplicate events due to replay but eventually the total ordering is still guaranteed. However, the producer of the old worker may not have been gracefully shut down. When the old worker recovered, some old events that were already placed in the producer's internal queue got sent out to Kafka before the producer was forcefully closed. This caused the "out-of-band" duplicate events, which we referred to as "ghost duplicates" or "zombie records. {quote} Can you verify our conclusion and do you have any recommendation for the potential fix or prevention? -- This message was sent by Atlassian Jira (v8.20.10#820010)