[ https://issues.apache.org/jira/browse/KAFKA-15339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17755812#comment-17755812 ]
Sagar Rao commented on KAFKA-15339: ----------------------------------- Thanks for reporting this bug [~functioner]. As you correctly pointed out, there seems to be no retry mechanism while trying to append files to the topic which is backed by a file. From what I can see, whenever it fails, `LocalLog#maybeHandleIOException` is invoked which marks the given directory as offline which I believe triggers the rest of the errors that you see on the Connect cluster. While the suggestion you provided is a valid one i.e to retry in such cases, I couldn't find that pattern being used anywhere in this part of the code. Having said that, have you noticed this error being thrown for versions < 3.5? The reason I ask that is that in v3.5, a migration of this class to the storage module was done. Here's the [link|https://github.com/apache/kafka/pull/13041] of the PR. AFAICS, there is not much difference and we should have experienced the same behaviour but was just curious. LMK. Or else, we can look at adding the retry mechanism > Transient I/O error happening in appending records could lead to the halt of > whole cluster > ------------------------------------------------------------------------------------------ > > Key: KAFKA-15339 > URL: https://issues.apache.org/jira/browse/KAFKA-15339 > Project: Kafka > Issue Type: Improvement > Components: connect, producer > Affects Versions: 3.5.0 > Reporter: Haoze Wu > Priority: Major > > We are running an integration test in which we start an Embedded Connect > Cluster in the active 3.5 branch. However, because of transient disk error, > we may encounter an IOException during appending records to one topic. As > shown in the stack trace: > {code:java} > [2023-08-13 16:53:51,016] ERROR Error while appending records to > connect-config-topic-connect-cluster-0 in dir > /tmp/EmbeddedKafkaCluster8003464883598783225 > (org.apache.kafka.storage.internals.log.LogDirFailureChannel:61) > java.io.IOException: > at > org.apache.kafka.common.record.MemoryRecords.writeFullyTo(MemoryRecords.java:92) > at > org.apache.kafka.common.record.FileRecords.append(FileRecords.java:188) > at kafka.log.LogSegment.append(LogSegment.scala:161) > at kafka.log.LocalLog.append(LocalLog.scala:436) > at kafka.log.UnifiedLog.append(UnifiedLog.scala:853) > at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:664) > at > kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1281) > at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1269) > at > kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:977) > at > scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28) > at > scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27) > at scala.collection.mutable.HashMap.map(HashMap.scala:35) > at > kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:965) > at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:623) > at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:680) > at kafka.server.KafkaApis.handle(KafkaApis.scala:180) > at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:76) > at java.lang.Thread.run(Thread.java:748) {code} > However, just because of failing to append the records to one partition. The > fetcher for all the other partitions are removed, broker shutdown, and > finally embedded connect cluster killed as whole. > {code:java} > [2023-08-13 17:35:37,966] WARN Stopping serving logs in dir > /tmp/EmbeddedKafkaCluster6777164631574762227 (kafka.log.LogManager:70) > [2023-08-13 17:35:37,968] ERROR Shutdown broker because all log dirs in > /tmp/EmbeddedKafkaCluster6777164631574762227 have failed > (kafka.log.LogManager:143) > [2023-08-13 17:35:37,968] WARN Abrupt service halt with code 1 and message > null (org.apache.kafka.connect.util.clusters.EmbeddedConnectCluster:130) > [2023-08-13 17:35:37,968] ERROR [LogDirFailureHandler]: Error due to > (kafka.server.ReplicaManager$LogDirFailureHandler:135) > org.apache.kafka.connect.util.clusters.UngracefulShutdownException: Abrupt > service halt with code 1 and message null {code} > I am wondering if we could add configurable retry around the root cause to > tolerate the possible I/O faults so that if the retry is successful, the > embedded connect cluster could still operate. > Any comments and suggestions would be appreciated. -- This message was sent by Atlassian Jira (v8.20.10#820010)