[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files

Congxian Qiu(klion26) (Jira) Thu, 05 Sep 2019 19:25:34 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-13808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923874#comment-16923874
 ]


Congxian Qiu(klion26) commented on FLINK-13808:
-----------------------------------------------

After analysis with [~Caesar], this issue is caused by the IO/Network problem.

The local directory will just be deleted on such below scenarios:
 * snapshot failed
 * snapshot succeed (when local recovery disabled)
 * have received a newer complete checkpoint(when local recovery enabled)

For this issue, the snapshot is still ongoing (uploading the sst files) when 
observed there are some files leaked. so the local directory would not be 
deleted.

I think FLINK-8871 help for this issue,
Aside from FLINK-8871, I want to propose such improvements:
 * just keep ${{maxConcurrentCheckpoint}} snapshot on TM side, which means if 
we have {{maxConcurrentCheckpoint}} is 2, and the current checkpoint 5, we'll 
cancel all the checkpoint before 4 (maybe the complete/cancel RPC message is 
late)
 * add some debug/trace log to track the steps of the snapshot on tm side, so 
users can know where is snapshot currently is on

[~StephanEwen] [~carp84] What do you think about the above two improvements, if 
this is ok, I'll file issues and contribute them.

> Checkpoints expired by timeout may leak RocksDB files
> -----------------------------------------------------
>
>                 Key: FLINK-13808
>                 URL: https://issues.apache.org/jira/browse/FLINK-13808
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.8.0, 1.8.1
>         Environment: So far only reliably reproducible on a 4-node cluster 
> with parallelism ≥ 100. But do try 
> https://github.com/jcaesar/flink-rocksdb-file-leak
>            Reporter: Julius Michaelis
>            Priority: Minor
>
> A RocksDB state backend with HDFS checkpoints, with or without local 
> recovery, may leak files in {{io.tmp.dirs}} on checkpoint expiry by timeout.
> If the size of a checkpoint crosses what can be transferred during one 
> checkpoint timeout, checkpoints will continue to fail forever. If this is 
> combined with a quick rollover of SST files (e.g. due to a high density of 
> writes), this may quickly exhaust available disk space (or memory, as /tmp is 
> the default location).
> As a workaround, the jobmanager's REST API can be frequently queried for 
> failed checkpoints, and associated files deleted accordingly.
> I've tried investing the cause a little bit, but I'm stuck:
>  * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before 
> completing.}} and similar gets printed, so
>  * [{{abortExpired}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547-L549],
>  so
>  * [{{dispose}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L416],
>  so
>  * [{{cancelCaller}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488],
>  so
>  * [the canceler is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497]
>  ([through one more 
> layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]),
>  so
>  * [{{cleanup}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95],
>  (possibly [not from 
> {{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]),
>  so
>  * [{{cleanupProvidedResources}} is 
> invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162]
>  (this is the indirection that made me give up), so
>  * [this trace 
> log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372]
>  should be printed, but it isn't.
> I have some time to further investigate, but I'd appreciate help on finding 
> out where in this chain things go wrong.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (FLINK-13808) Checkpoints expired by timeout may leak RocksDB files

Reply via email to