Tom Thornton created KAFKA-18073:
------------------------------------

             Summary: Data loss when Kafka Connect retriable exception raised
                 Key: KAFKA-18073
                 URL: https://issues.apache.org/jira/browse/KAFKA-18073
             Project: Kafka
          Issue Type: Bug
          Components: connect
            Reporter: Tom Thornton


We are experiencing data loss when Kafka Connect source connector fails to send 
data to Kafka topic after receiving a retriable exception from a schema 
registry (when converting the record).

Kafka cluster & connect details:
 # Kafka Connect version: 3.8.1
 # Kafka Version 3.6.1
 # Cluster size: 16 brokers
 # Number of partitions in Kafka data topic: 32
 # Number of partitions in offsets topic: 
 # Replication Factor: 4
 # Min In Sync Replicas: 2
 # Unclean leader election disabled
 # Uses Apicurio schema registry to convert to Avro
 # Uses default error tolerance i.e., None

Producer configs
 * acks = -1
 * max.in.flight.requests.per.connection = 1
 * batch.size = 524288

Our connector uses the offsets committed by Kafka Connect (does not have any 
internal checkpointing/state storage).

Data loss is noticed on occasion when a timeout is received during the 
[convertTransformedRecod|https://github.com/apache/kafka/blob/3.8.1/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L398]
 call. The timeout is from connecting to the schema registry for the schema. 
Full stack trace 
[here|https://gist.github.com/twthorn/5764b9dae2664ca72e71c1e6ebd4c792]

Logs (dropped record is from task 79):
{code:java}
2024-10-22T01:37:41.059Z - INFO - WorkerSourceTask{id=keyspace1-connector-79} 
Committing offsets for 3 acknowledged messages
2024-10-22T01:37:41.066Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-232} Committing offsets for 1 
acknowledged messages
2024-10-22T01:37:41.540Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-259} Committing offsets for 3 
acknowledged messages
2024-10-22T01:37:41.928Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-273} Committing offsets for 2 
acknowledged messages
2024-10-22T01:37:44Z - info - Handshake message sent
2024-10-22T01:37:51Z - info - Handshake timed out
2024-10-22T01:37:52Z - info - Handshake message sent
2024-10-22T01:37:58Z - info - Handshake timed out
2024-10-22T01:38:04Z - info - Handshake message sent
2024-10-22T01:38:10Z - info - Handshake timed out
2024-10-22T01:38:11.664Z - INFO - [Producer 
clientId=connector-producer-keyspace1-table1-connector-268] Node 1078 
disconnected.
2024-10-22T01:38:15.798Z - INFO - [Producer 
clientId=connector-producer-keyspace1-table1-connector-219] Node 1089 
disconnected.
2024-10-22T01:38:20Z - info - Handshake message sent
2024-10-22T01:38:23.225Z - INFO - [Producer 
clientId=connector-producer-keyspace1-connector-107] Node 1085 disconnected.
2024-10-22T01:38:23.224Z - INFO - [Producer 
clientId=connector-producer-keyspace1-connector-107] Node 1088 disconnected.
2024-10-22T01:38:24.536Z - INFO - [Producer 
clientId=connector-producer-keyspace1-table1-connector-13] Node 1076 
disconnected.
2024-10-22T01:38:27Z - info - Handshake timed out
2024-10-22T01:38:27.491Z - INFO - [Producer 
clientId=connector-producer-keyspace1-table1-connector-237] Node 1082 
disconnected.
2024-10-22T01:38:27.492Z - INFO - [Producer 
clientId=connector-producer-keyspace1-table1-connector-237] Node 1089 
disconnected.
2024-10-22T01:38:27.493Z - INFO - [Producer 
clientId=connector-producer-keyspace1-table1-connector-237] Node 1085 
disconnected.
2024-10-22T01:38:27.832Z - INFO - [Producer 
clientId=connector-producer-keyspace1-table1-connector-237] Node 1088 
disconnected.
2024-10-22T01:38:28.470Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-12} Committing offsets for 2 
acknowledged messages
2024-10-22T01:38:28.549Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-64} Committing offsets for 14 
acknowledged messages
2024-10-22T01:38:36.930Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-53} Committing offsets for 38 
acknowledged messages
2024-10-22T01:38:37.701Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-13} Committing offsets for 2 
acknowledged messages
2024-10-22T01:38:37.838Z - INFO - [AdminClient 
clientId=dbzium-prod-keyspace1-shared-admin] Node 1089 disconnected.
2024-10-22T01:38:38.369Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-25} Committing offsets for 6 
acknowledged messages
2024-10-22T01:38:39.212Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-56} Committing offsets for 21 
acknowledged messages
2024-10-22T01:38:39.597Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-51} Committing offsets for 11 
acknowledged messages
2024-10-22T01:38:39.642Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-219} Committing offsets for 4 
acknowledged messages
2024-10-22T01:38:39.647Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-237} Committing offsets for 1 
acknowledged messages
2024-10-22T01:38:39.681Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-268} Committing offsets for 8 
acknowledged messages
2024-10-22T01:38:41.059Z - INFO - WorkerSourceTask{id=keyspace1-connector-107} 
Committing offsets for 3 acknowledged messages
2024-10-22T01:38:41.540Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-232} Committing offsets for 3 
acknowledged messages
2024-10-22T01:38:41.928Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-259} Committing offsets for 2 
acknowledged messages
2024-10-22T01:38:42.252Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-273} Committing offsets for 1 
acknowledged messages
2024-10-22T01:38:44.913Z - ERROR - Failed to send HTTP request to endpoint: 
http://schema-registry.service.prod-us-east-1-dw1.consul:8080/apis/ccompat/v6/subjects/prod.keyspace1.table1-key/versions?normalize=false
2024-10-22T01:38:52Z - info - Handshake message sent
2024-10-22T01:39:06.790Z - INFO - [Worker 
clientId=connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083,
 groupId=dbzium-prod-keyspace1] Rebalance started
2024-10-22T01:39:07.134Z - INFO - [Worker 
clientId=connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083,
 groupId=dbzium-prod-keyspace1] Successfully joined group with generation 
Generation{generationId=2477, 
memberId='connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083-a174acf5-1b8f-419a-a25a-4c58dab02a8d',
 protocol='sessioned'}
2024-10-22T01:39:07.171Z - INFO - [Worker 
clientId=connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083,
 groupId=dbzium-prod-keyspace1] Successfully synced group in generation 
Generation{generationId=2477, 
memberId='connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083-a174acf5-1b8f-419a-a25a-4c58dab02a8d',
 protocol='sessioned'}
2024-10-22T01:39:07.171Z - INFO - [Worker 
clientId=connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083,
 groupId=dbzium-prod-keyspace1] Joined group at generation 2477 with protocol 
version 2 and got assignment: Assignment{error=0, 
leader='connect-dbzium-prod-keyspace1-b7cfc9bcc-h8gxb.node.prod-us-east-1-vitess1.consul:8083-dbbc75cb-1de9-4907-bf4c-f3f6cca6eb9a',
 
leaderUrl='http://dbzium-prod-keyspace1-b7cfc9bcc-h8gxb.node.prod-us-east-1-vitess1.consul:8083/',
 offset=236838, connectorIds=[], taskIds=[keyspace1-table1-connector-12, 
keyspace1-table1-connector-13, keyspace1-table1-connector-25, 
keyspace1-table1-connector-51, keyspace1-table1-connector-53, 
keyspace1-table1-connector-56, keyspace1-table1-connector-64, 
keyspace1-table1-connector-219, keyspace1-table1-connector-232, 
keyspace1-table1-connector-237, keyspace1-table1-connector-259, 
keyspace1-table1-connector-268, keyspace1-table1-connector-273, 
keyspace1-connector-79, keyspace1-connector-107, keyspace1-connector-285], 
revokedConnectorIds=[], revokedTaskIds=[], delay=180000} with rebalance delay: 
180000
2024-10-22T01:39:07.172Z - INFO - [Worker 
clientId=connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083,
 groupId=dbzium-prod-keyspace1] Finished starting connectors and tasks
2024-10-22T01:39:07.172Z - INFO - [Worker 
clientId=connect-dbzium-prod-keyspace1-b7cfc9bcc-8z59m.node.prod-us-east-1-vitess1.consul:8083,
 groupId=dbzium-prod-keyspace1] Starting connectors and tasks using config 
offset 236838
2024-10-22T01:39:12Z - info - Handshake message received
2024-10-22T01:39:12Z - info - Handshake message sent
2024-10-22T01:39:27.534Z - INFO - [Producer 
clientId=connector-producer-keyspace1-connector-107] Node 1074 disconnected.
2024-10-22T01:39:27.535Z - INFO - [Producer 
clientId=connector-producer-keyspace1-connector-107] Node 1077 disconnected.
2024-10-22T01:39:28.550Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-12} Committing offsets for 4 
acknowledged messages
2024-10-22T01:39:28.622Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-64} Committing offsets for 48 
acknowledged messages
2024-10-22T01:39:37.102Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-53} Committing offsets for 30 
acknowledged messages
2024-10-22T01:39:38.194Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-13} Committing offsets for 3 
acknowledged messages
2024-10-22T01:39:38.556Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-25} Committing offsets for 14 
acknowledged messages
2024-10-22T01:39:39.598Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-56} Committing offsets for 15 
acknowledged messages
2024-10-22T01:39:40.043Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-51} Committing offsets for 8 
acknowledged messages
2024-10-22T01:39:40.086Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-219} Committing offsets for 6 
acknowledged messages
2024-10-22T01:39:40.096Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-237} Committing offsets for 1 
acknowledged messages
2024-10-22T01:39:40.100Z - INFO - 
WorkerSourceTask{id=keyspace1-table1-connector-268} Committing offsets for 8 
acknowledged messages
2024-10-22T01:39:41.392Z - INFO - WorkerSourceTask{id=keyspace1-connector-79} 
Committing offsets for 12 acknowledged messages{code}
Does Kafka Connect commit offsets for a batch even if some of the records 
failed? Or must the whole batch be successful to commit offsets?

It is odd there is no restart, just a logged error that should have been 
retried.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to