1996fanrui commented on PR #20689:
URL: https://github.com/apache/flink/pull/20689#issuecomment-1351217015

   Hi @Myasuka , thanks a lot for your review, I want to add more information 
here, please help take a look in your free time, thanks!
   
   This problem also occurred in our production environment. The shared 
directory of a flink job has more than 1 million files. It exceeded the hdfs 
upper limit, causing new files not to be written. 
   
   However only 50k files are available, the other 950k files should be cleaned 
up.
   
   <img width="1670" alt="image" 
src="https://user-images.githubusercontent.com/38427477/207588272-dda7ba69-c84c-4372-aeb4-c54657b9b956.png";>
   
   <img width="1451" alt="image" 
src="https://user-images.githubusercontent.com/38427477/207589898-7b8f6c1b-8947-4fa1-843a-c7e7103aa755.png";>
   
   
   ## I want to express the root cause again:
   
   Async thread is creating 
outputStream(`FsCheckpointStateOutputStream#flushToFile -> createStream`), and 
the response of hdfs may be slow. At this same time, the task thread calls 
`FsCheckpointStateOutputStream#close`, outputStream and statePath are null, so 
outputStream will not be closed and statePath will not be cleaned up.
   
   When the Async thread ends, FileSystemSafetyNet will close the outputStream 
without cleaning it up. so, it will be kept forever.
   
   ## How to reproduce?
   
   I added some delay inside the `createStream` and turn down the checkpoint 
timeout, it's easy to reproduce this bug. It will keep too many files to the 
hdfs forever.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to