[ 
https://issues.apache.org/jira/browse/FLINK-10930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16708978#comment-16708978
 ] 

Yun Tang commented on FLINK-10930:
----------------------------------

Speaking of eager and proper discarding useless state, this is really a pain 
point for us since we could have 200TB+ state in our real jobs, which means 
even 1% useless state left could result in 2TB logical space occupied in our 
DFS. I really agreed that cleaning exclusive data is much easier with current 
checkpoint directory layout. However, I think current layout has no help with 
cleaning shared state, which occupying most of space in each checkpoint, but 
have to list files under shared state's directory. We had implemented similar 
cleanup mechanism by introducing new {{listStatusIterator}} API to reduce 
pressure to DFS and memory footprint in JM, looking up in refactored snapshot 
registry and then discard useless state handles asynchronously.

Currently, I prefer to close this issue after above discussion with 
[~StephanEwen]. We should focus on how to handle checkpoint exception better 
and fails on JM's side when some conditions achieved.

> Refactor checkpoint directory layout
> ------------------------------------
>
>                 Key: FLINK-10930
>                 URL: https://issues.apache.org/jira/browse/FLINK-10930
>             Project: Flink
>          Issue Type: Improvement
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.8.0
>            Reporter: Yun Tang
>            Assignee: Yun Tang
>            Priority: Major
>             Fix For: 1.8.0
>
>
> The current checkpoint directory layout is introduced from FLINK-8531 with 
> three different scopes for tasks:
>  * *EXCLUSIVE* is for state that belongs to one checkpoint only, meta data 
> and operator state files.
>  * *SHARED* is for state that is possibly part of multiple checkpoints
>  * *TASKOWNED* is for state that must never by dropped by the jobManager.
> {code:java}
> /user-defined-dir/{job-id}
>                   |
>                   +-- shared/
>                   +-- taskowned/
>                   +-- chk-1/      // metadata and operator-state files
>                   +-- chk-2/
>                     ...{code}
> If we just retain one complete checkpoint, the expected exclusive directory, 
> which is the {{chk-id}} checkpoint directory, should only be one left. 
> However, as FLINK-10855 interpreted, the failed/expired checkpoint 
> directories would also be left. This is really confusing for users who [uses 
> externalized checkpoint to resume 
> job|https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/state/checkpoints.html#resuming-from-a-retained-checkpoint],
>  not to mention the checkpoint directory resource leak. 
>  As far as I could know, if the {{chk-id}} checkpoint directory still 
> contains the operator state files, I have no idea how to clean the useless 
> {{chk-id}} checkpoint directory gracefully. Once job manager dispose the 
> failed/expired checkpoint, the target {{chk-id}} checkpoint directory would 
> be deleted by JM. However, this directory would also be create by tasks who 
> having not reported to JM. When {{checkpoint coordinator}} received those 
> late expired tasks, it would discard those useless handles. However, if JM 
> also plans to delete the empty parent folder, which is already unsupported 
> after FLINK-8540, another task uploading operator state files would meet 
> exception due to its writing target's parent directory has just been removed. 
> Currently, we handle task checkpoint failure as task failure and the whole 
> job would failover which is not we want.
> From what I see, I plan to separate *EXCLUSIVE* directory into two kind of 
> exclusive directories, one is still several {{chk-id}} checkpoint directories 
> but only contains its exclusive {{meta data}}, the other is just one 
> directory named {{exclusive}} which containing the operator state files. 
> Operator state files are exclusive to just one specified checkpoint, we could 
> also add {{checkpoint-id}} within their file name to let users easily clean 
> up.
>  The refactored directory layout should be :
> {code:java}
> /user-defined-dir/{job-id}
>                     |
>                   +-- shared/
>                   +-- taskowned/
>                   +-- exclusive/    // operator state files
>                   +-- chk-1/        // metadata
>                     +-- chk-2/
>                     ...{code}
>  
> This new directory layout would not affect users who use external checkpoint 
> to resume jobs, since they still just give 
> {{/user-defined-dir/job-id/chk-id}} path to resume job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to