Jerome Morel created KAFKA-16997:
------------------------------------
Summary: do not stop kafka when issue to delete a partition folder
Key: KAFKA-16997
URL: https://issues.apache.org/jira/browse/KAFKA-16997
Project: Kafka
Issue Type: Improvement
Components: core
Affects Versions: 3.6.2
Reporter: Jerome Morel
Context: In our project we create different partitions and even if we delete
the segments those remains and it came out we have so many partitions that
kafka crashes due to amount of open files. Therefore we want to delete
regularly those partitions but we get during that kafka stopping.
The issue: after some investigations we found out that the deletion process
gives sometimes warnings if it cannot delete some log files:
{code:java}
[2024-06-17 15:52:39,590] WARN Failed atomic move of
/tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete/00000000000000000010.timeindex
to
/tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete/00000000000000000010.timeindex.deleted
retrying with a non-atomic move (org.apache.kafka.common.utils.Utils)
java.nio.file.NoSuchFileException:
/tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete/00000000000000000010.timeindex
->
/tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete/00000000000000000010.timeindex.deleted
at
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
at
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
at java.base/sun.nio.fs.UnixCopyFile.move(UnixCopyFile.java:416)
at
java.base/sun.nio.fs.UnixFileSystemProvider.move(UnixFileSystemProvider.java:266)
at java.base/java.nio.file.Files.move(Files.java:1432)
at
org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:980)
at
org.apache.kafka.storage.internals.log.LazyIndex$IndexFile.renameTo(LazyIndex.java:80)
at
org.apache.kafka.storage.internals.log.LazyIndex.renameTo(LazyIndex.java:202)
at
org.apache.kafka.storage.internals.log.LogSegment.changeFileSuffixes(LogSegment.java:666)
at kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$1(LocalLog.scala:912)
at
kafka.log.LocalLog$.$anonfun$deleteSegmentFiles$1$adapted(LocalLog.scala:910)
at scala.collection.immutable.List.foreach(List.scala:431)
at kafka.log.LocalLog$.deleteSegmentFiles(LocalLog.scala:910)
at kafka.log.LocalLog.removeAndDeleteSegments(LocalLog.scala:289) {code}
And just continue but when it is to delete a folder then it mark the replica as
not ok and then stop kafka if only replica available (which is our case):
{code:java}
[2024-06-17 15:52:39,637] ERROR Error while deleting dir for
69747657-f49d-453f-9fa2-4d4369199699-0 in dir
/tmp/kafka-logs-mnt/kafka-no-docker
(org.apache.kafka.storage.internals.log.LogDirFailureChannel)
java.nio.file.DirectoryNotEmptyException:
/tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete
at
java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:246)
at
java.base/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
at java.base/java.nio.file.Files.delete(Files.java:1152)
at
org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:923)
at
org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:901)
at java.base/java.nio.file.Files.walkFileTree(Files.java:2828)
at java.base/java.nio.file.Files.walkFileTree(Files.java:2882)
at org.apache.kafka.common.utils.Utils.delete(Utils.java:901)
at kafka.log.LocalLog.$anonfun$deleteEmptyDir$2(LocalLog.scala:243)
at kafka.log.LocalLog.deleteEmptyDir(LocalLog.scala:709)
at kafka.log.UnifiedLog.$anonfun$delete$2(UnifiedLog.scala:1734)
at kafka.log.UnifiedLog.delete(UnifiedLog.scala:1911)
at kafka.log.LogManager.deleteLogs(LogManager.scala:1152)
at kafka.log.LogManager.$anonfun$deleteLogs$6(LogManager.scala:1166)
at
org.apache.kafka.server.util.KafkaScheduler.lambda$schedule$1(KafkaScheduler.java:150)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
[2024-06-17 15:52:39,640] WARN [ReplicaManager broker=0] Stopping serving
replicas in dir /tmp/kafka-logs-mnt/kafka-no-docker
(kafka.server.ReplicaManager)
[2024-06-17 15:52:39,640] INFO [LocalLog
partition=a11f3352-56fc-4d00-bdf8-f5fee33391f6-0,
dir=/tmp/kafka-logs-mnt/kafka-no-docker] Deleting segment files
LogSegment(baseOffset=0, size=861, lastModifiedTime=0,
largestRecordTimestamp=1718632120826) (kafka.log.LocalLog$)
[2024-06-17 15:52:39,641] ERROR Uncaught exception in scheduled task
'delete-file' (org.apache.kafka.server.util.KafkaScheduler)
org.apache.kafka.common.errors.KafkaStorageException: The log dir
/tmp/kafka-logs-mnt/kafka-no-docker is already offline due to a previous IO
exception.
[2024-06-17 15:52:39,641] ERROR Exception while deleting
Log(dir=/tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete,
topicId=wohaEWpfTR6HuqDFlcIJYw, topic=69747657-f49d-453f-9fa2-4d4369199699,
partition=0, highWatermark=10, lastStableOffset=10, logStartOffset=10,
logEndOffset=10) in dir /tmp/kafka-logs-mnt/kafka-no-docker.
(kafka.log.LogManager)
org.apache.kafka.common.errors.KafkaStorageException: Error while deleting dir
for 69747657-f49d-453f-9fa2-4d4369199699-0 in dir
/tmp/kafka-logs-mnt/kafka-no-docker
Caused by: java.nio.file.DirectoryNotEmptyException:
/tmp/kafka-logs-mnt/kafka-no-docker/69747657-f49d-453f-9fa2-4d4369199699-0.7b51dad41a77448d8b419c76749f0b2c-delete
at
java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:246)
at
java.base/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105)
at java.base/java.nio.file.Files.delete(Files.java:1152)
at
org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:923)
at
org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:901)
at java.base/java.nio.file.Files.walkFileTree(Files.java:2828)
at java.base/java.nio.file.Files.walkFileTree(Files.java:2882)
at org.apache.kafka.common.utils.Utils.delete(Utils.java:901)
at kafka.log.LocalLog.$anonfun$deleteEmptyDir$2(LocalLog.scala:243)
at kafka.log.LocalLog.deleteEmptyDir(LocalLog.scala:709)
at kafka.log.UnifiedLog.$anonfun$delete$2(UnifiedLog.scala:1734)
at kafka.log.UnifiedLog.delete(UnifiedLog.scala:1911)
at kafka.log.LogManager.deleteLogs(LogManager.scala:1152)
at kafka.log.LogManager.$anonfun$deleteLogs$6(LogManager.scala:1166)
at
org.apache.kafka.server.util.KafkaScheduler.lambda$schedule$1(KafkaScheduler.java:150)
at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
[2024-06-17 15:52:39,642] INFO [ReplicaFetcherManager on broker 0] Removed
fetcher for partitions Set {code}
we tried with different version of kafka (2.8 and 3.7) and it is the same.
Is there a reason to just put a warning when a file in the partition cannot be
deleted but blew up when it is the directory itself that cannot be deleted? Is
it possible to also gives a warning when the directory cannot be deleted and
just process.
In our case after restart of kafka all gets deleted as expected (disc glitch
issue).
Remark: our server does not have local storage so we use a network disc and
such glitch may happen often.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)