[
https://issues.apache.org/jira/browse/KAFKA-15339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Greg Harris updated KAFKA-15339:
--------------------------------
Component/s: KafkaConnect
(was: connect)
> Transient I/O error happening in appending records could lead to the halt of
> whole cluster
> ------------------------------------------------------------------------------------------
>
> Key: KAFKA-15339
> URL: https://issues.apache.org/jira/browse/KAFKA-15339
> Project: Kafka
> Issue Type: Improvement
> Components: KafkaConnect, producer
> Affects Versions: 3.5.0
> Reporter: Haoze Wu
> Priority: Major
>
> We are running an integration test in which we start an Embedded Connect
> Cluster in the active 3.5 branch. However, because of transient disk error,
> we may encounter an IOException during appending records to one topic. As
> shown in the stack trace:
> {code:java}
> [2023-08-13 16:53:51,016] ERROR Error while appending records to
> connect-config-topic-connect-cluster-0 in dir
> /tmp/EmbeddedKafkaCluster8003464883598783225
> (org.apache.kafka.storage.internals.log.LogDirFailureChannel:61)
> java.io.IOException:
> at
> org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:92)
> at
> org.apache.kafka.common.record.FileRecords.append(FileRecords.java:188)
> at kafka.log.LogSegment.append(LogSegment.scala:161)
> at kafka.log.LocalLog.append(LocalLog.scala:436)
> at kafka.log.UnifiedLog.append(UnifiedLog.scala:853)
> at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:664)
> at
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1281)
> at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1269)
> at
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:977)
> at
> scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
> at
> scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
> at scala.collection.mutable.HashMap.map(HashMap.scala:35)
> at
> kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:965)
> at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:623)
> at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:680)
> at kafka.server.KafkaApis.handle(KafkaApis.scala:180)
> at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:76)
> at java.lang.Thread.run(Thread.java:748) {code}
> However, just because of failing to append the records to one partition. The
> fetcher for all the other partitions are removed, broker shutdown, and
> finally embedded connect cluster killed as whole.
> {code:java}
> [2023-08-13 17:35:37,966] WARN Stopping serving logs in dir
> /tmp/EmbeddedKafkaCluster6777164631574762227 (kafka.log.LogManager:70)
> [2023-08-13 17:35:37,968] ERROR Shutdown broker because all log dirs in
> /tmp/EmbeddedKafkaCluster6777164631574762227 have failed
> (kafka.log.LogManager:143)
> [2023-08-13 17:35:37,968] WARN Abrupt service halt with code 1 and message
> null (org.apache.kafka.connect.util.clusters.EmbeddedConnectCluster:130)
> [2023-08-13 17:35:37,968] ERROR [LogDirFailureHandler]: Error due to
> (kafka.server.ReplicaManager$LogDirFailureHandler:135)
> org.apache.kafka.connect.util.clusters.UngracefulShutdownException: Abrupt
> service halt with code 1 and message null {code}
> I am wondering if we could add configurable retry around the root cause to
> tolerate the possible I/O faults so that if the retry is successful, the
> embedded connect cluster could still operate.
> Any comments and suggestions would be appreciated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)