[
https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16953299#comment-16953299
]
Julius Michaelis commented on FLINK-13808:
------------------------------------------
As for something else that may be easier to implement than a cancellation
message: why don't the taskmanagers also cancel checkpoints based on timeouts?
(Of course, that doesn't catch other cancellation reasons, but…)
> Checkpoints expired by timeout may leak RocksDB files
> -----------------------------------------------------
>
> Key: FLINK-13808
> URL: https://issues.apache.org/jira/browse/FLINK-13808
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / State Backends
> Affects Versions: 1.8.0, 1.8.1
> Environment: So far only reliably reproducible on a 4-node cluster
> with parallelism ≥ 100. But do try
> https://github.com/jcaesar/flink-rocksdb-file-leak
> Reporter: Julius Michaelis
> Priority: Minor
>
> A RocksDB state backend with HDFS checkpoints, with or without local
> recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout.
> If the size of a checkpoint crosses what can be transferred during one
> checkpoint timeout, checkpoints will continue to fail forever. If this is
> combined with a quick rollover of SST files (e.g. due to a high density of
> writes), this may quickly exhaust available disk space (or memory, as /tmp is
> the default location).
> As a workaround, the jobmanager's REST API can be frequently queried for
> failed checkpoints, and associated files deleted accordingly.
> I've tried investing the cause a little bit, but I'm stuck:
> * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before
> completing.}} and similar gets printed, so
> * [{{abortExpired}} is
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549],
> so
> * [{{dispose}} is
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416],
> so
> * [{{cancelCaller}} is
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488],
> so
> * [the canceler is
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497]
> ([through one more
> layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]),
> so
> * [{{cleanup}} is
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95],
> (possibly [not from
> {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]),
> so
> * [{{cleanupProvidedResources}} is
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162]
> (this is the indirection that made me give up), so
> * [this trace
> log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372]
> should be printed, but it isn't.
> I have some time to further investigate, but I'd appreciate help on finding
> out where in this chain things go wrong.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)