[GitHub] [flink] 1996fanrui commented on pull request #20689: [FLINK-28984][runtime] Fix the problem that FsCheckpointStateOutputStream is not being released normally

GitBox Wed, 14 Dec 2022 04:13:08 -0800


1996fanrui commented on PR #20689:
URL: https://github.com/apache/flink/pull/20689#issuecomment-1351217015

Hi @Myasuka , thanks a lot for your review, I want to add more information
here, please help take a look in your free time, thanks!

This problem also occurred in our production environment. The shared
directory of a flink job has more than 1 million files. It exceeded the hdfs
upper limit, causing new files not to be written.

However only 50k files are available, the other 950k files should be cleaned
up.

## I want to express the root cause again:

Async thread is creating
outputStream(`FsCheckpointStateOutputStream#flushToFile -> createStream`), and
the response of hdfs may be slow. At this same time, the task thread calls
`FsCheckpointStateOutputStream#close`, outputStream and statePath are null, so
outputStream will not be closed and statePath will not be cleaned up.

When the Async thread ends, FileSystemSafetyNet will close the outputStream
without cleaning it up. so, it will be kept forever.

## How to reproduce?

I added some delay inside the `createStream` and turn down the checkpoint
timeout, it's easy to reproduce this bug. It will keep too many files to the
hdfs forever.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] 1996fanrui commented on pull request #20689: [FLINK-28984][runtime] Fix the problem that FsCheckpointStateOutputStream is not being released normally

Reply via email to