[jira] [Commented] (FLINK-13633) Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of high-availability storage

Yang Wang (Jira) Tue, 03 Sep 2019 01:12:23 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-13633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921239#comment-16921239
 ]


Yang Wang commented on FLINK-13633:
-----------------------------------

[~azagrebin]

I have submitted a PR, please help to review. 

Thanks.

> Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of  
> high-availability storage
> -------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-13633
>                 URL: https://issues.apache.org/jira/browse/FLINK-13633
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Coordination
>            Reporter: Yang Wang
>            Priority: Major
>
> Currently, if we enable the high-availability, the ha storage directory 
> structure is stored as below. The submittedJobGraph and completedCheckpoint 
> are directly stored under the ha storage path. It is reasonable when the 
> flink cluster finished normally. However, when the Yarn application is failed 
> or killed, the submittedJobGraph and completedCheckpoint will exist there 
> forever. Even we could not know which flink cluster(Yarn application) they 
> belongs to. So i suggest to move them into application subdirectory. Some 
> external tools could be used to clean up these residual files.
> Also, we need to do best effort clean-up before the flink cluster finishes. 
> Current ha storage directory structure
> {code:java}
> └── <high-availability.storageDir>
>     ├── submittedJobGraph
>     ├                  ├ <jobgraph1>(random named)
>     ├                  ├ <jobgraph2>(random named)
>     ├── completedCheckpoint
>     ├              ├ <checkpoint1>(random named)
>     ├              ├ <checkpoint2>(random named)
>     ├              ├ <checkpoint3>(random named)
>     ├── <high-availability.cluster-id>
>            ├── blob
>                   ├── <blob1>(named as [no_job|job_<job-id>]/blob_<blob-key>)
> {code}
>  
> The new ha storage directory structure
> {code:java}
> └── <high-availability.storageDir>
>     ├── <high-availability.cluster-id>
>               ├── submittedJobGraph
>               ├                  ├ <jobgraph1>(random named)
>               ├                  ├ <jobgraph2>(random named)
>               ├── completedCheckpoint
>               ├               ├ <checkpoint1>(random named)
>               ├               ├ <checkpoint2>(random named)
>               ├               ├ <checkpoint1>(random named)
>               ├── blob
>                      ├── <blob1>(named as 
> [no_job|job_<job-id>]/blob_<blob-key>) {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (FLINK-13633) Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of high-availability storage

Reply via email to