Shichao An created KAFKA-19633:
----------------------------------

             Summary: Kafka Connect connectors sent out zombie records during 
rebalance
                 Key: KAFKA-19633
                 URL: https://issues.apache.org/jira/browse/KAFKA-19633
             Project: Kafka
          Issue Type: Bug
          Components: connect
    Affects Versions: 3.2.0
            Reporter: Shichao An


Hi, we run Debezium connectors on Kafka Connect. We identified several "zombie" 
records that are delivered by the connectors during or after the rebalance. 
Since the downstream consumers require ordering, this issue breaks several 
things where previous primitives were build upon.

Here are an overview of the setup:
 * Connector type: Debezium Mongo Connector
 * Kafka Connect version: 3.2
 * Number of workers: 3-4
 * Kafka producer configs: at-least once settings, ack=all, max inflight 
requests=1

The following conclusion are based on our investigation:
{quote}When a Kafka Connect worker (part of a connector cluster) is overloaded 
or degraded, the connector on it may become temporarily unhealthy. The Kafka 
Connect cluster will rebalance the connector by "moving" it to another worker. 
When the connector is started on the new worker, the events will resume 
normally without any data loss and depending on the previously committed 
offsets, there might be a small amount of duplicate events due to replay but 
eventually the total ordering is still guaranteed. 

However, the producer of the old worker may not have been gracefully shut down. 
When the old worker recovered, some old events that were already placed in the 
producer's internal queue got sent out to Kafka before the producer was 
forcefully closed. This caused the "out-of-band" duplicate events, which we 
referred to as "ghost duplicates" or "zombie records.
{quote}
Can you verify our conclusion and do you have any recommendation for the 
potential fix or prevention?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to