[ https://issues.apache.org/jira/browse/KAFKA-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096436#comment-17096436 ]
yongmao wang commented on KAFKA-7575: ------------------------------------- Here is my initial investigation about this issue: The error happened when renaming a file(replication-offset-checkpoint.tmp) to another name, and it would result in the Kafka service stops. From the Kafka logs, we saw there are "AccessDeniedException" and "FileAlreadyExistsException" was throw. After running the testing code, I found only when create a buffered reader using the "Files.newBufferedReader()" method and doesn't close it during the file move operation. The same error can be reproduced. I have reproduced the same exceptions by writing the test codes on my Windows debug machine. I run the same file move codes of the Kafka method "atomicMoveWithFallback()" and tried to find out which case would generate the same exceptions. From the below screenshot, we could see if create a buffered reader using the "Files.newBufferedReader()" method and doesn't close it before the file move operation, it will throw the "java.nio.file.AccessDeniedException" firstly and then "java.nio.file.FileAlreadyExistsException": !https://rally1.rallydev.com/slm/attachment/372362614736/Screen%20Shot%202020-02-25%20at%206.31.31%20AM.png|width=1256,height=497! Then I reviewed Kafka source codes again and found there is another method ["read()" which under the class "CheckPointFile.scala"|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/checkpoints/CheckpointFile.scala#L127]. It calls the method "Files.newBufferedReader()" to read the context from the file "replication-offset-checkpoint". So I guess the error that happened was caused by that there may be another thread is calling this read() method during the write() method that wants to move the replication-offset-checkpoint.tmp file to replication-offset-checkpoint. So the file rename operation failed. Then I used the"Find Usages" function tried to find out where this method is used. But I found that a lot of places using this read() method. To identify which place cause this thread problem is very difficult. I noticed that under the same class "CheckPointFile.scala", besides the write() method, there is another "read()" method that would be open a buffered reader. Then I suspect this may be a multip thread problem. But for both of them, they use the keyword "lock synchronized" for the whole code block. So this is very strange for us. Now it is very difficult to find out where is the root cause. How about your ideas? Thanks > 'Error while writing to checkpoint file' Issue > ---------------------------------------------- > > Key: KAFKA-7575 > URL: https://issues.apache.org/jira/browse/KAFKA-7575 > Project: Kafka > Issue Type: Bug > Components: producer > Affects Versions: 1.1.1 > Environment: Windows 10, Kafka 1.1.1 > Reporter: Dasun Nirmitha > Priority: Major > Attachments: Dry run error.rar > > > I'm currently testing a Java Kafka producer application coded to retrieve a > db value from a local mysql db and produce to a single topic. Locally I've > got a Zookeeper server and a Kafka single broker running. > My issue is I need to produce this from the Kafka producer each second, and > that works for around 2 hours until broker throws an 'Error while writing to > checkpoint file' and shuts down. Producing with a 1 minute interval works > with no issues but unfortunately I need the produce interval to be 1 second. > I have attached a rar containing screenshots of the Errors thrown from the > Broker and my application. -- This message was sent by Atlassian Jira (v8.3.4#803005)