[jira] [Updated] (FLINK-13633) Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of high-availability storage

TisonKun (JIRA) Wed, 07 Aug 2019 04:17:08 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-13633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


TisonKun updated FLINK-13633:
-----------------------------
    Description: 
Currently, if we enable the high-availability, the ha storage directory 
structure is stored as below. The submittedJobGraph and completedCheckpoint are 
directly stored under the ha storage path. It is reasonable when the flink 
cluster finished normally. However, when the Yarn application is failed or 
killed, the submittedJobGraph and completedCheckpoint will exist there forever. 
Even we could not know which flink cluster(Yarn application) they belongs to. 
So i suggest to move them into application subdirectory. Some external tools 
could be used to clean up these residual files.

Also, we need to do best effort clean-up before the flink cluster finishes. 

Current ha storage directory structure
{code:java}
└── <high-availability.storageDir>
    ├── submittedJobGraph
    ├                  ├ <jobgraph1>(random named)
    ├                  ├ <jobgraph2>(random named)
    ├── completedCheckpoint
    ├              ├ <checkpoint1>(random named)
    ├              ├ <checkpoint2>(random named)
    ├              ├ <checkpoint3>(random named)
    ├── <high-availability.cluster-id>
           ├── blob
                  ├── <blob1>(named as [no_job|job_<job-id>]/blob_<blob-key>)
{code}
 

The new ha storage directory structure
{code:java}
└── <high-availability.storageDir>
    ├── <high-availability.cluster-id>
              ├── submittedJobGraph
              ├                  ├ <jobgraph1>(random named)
              ├                  ├ <jobgraph2>(random named)
              ├── completedCheckpoint
              ├               ├ <checkpoint1>(random named)
              ├               ├ <checkpoint2>(random named)
              ├               ├ <checkpoint1>(random named)
              ├── blob
                     ├── <blob1>(named as 
[no_job|job_<job-id>]/blob_<blob-key>) {code}

  was:
Currently, if we enable the high-availability, the ha storage directory 
structure is stored as below. The submittedJobGraph and completedCheckpoint are 
directly stored under the ha storage path. It is reasonable when the flink 
cluster finished normally. However, when the Yarn application is failed or 
killed, the submittedJobGraph and completedCheckpoint will exist there forever. 
Even we could not know which flink cluster(Yarn application) they belongs to. 
So i suggest to move them into application subdirectory. Some external tools 
could be used to clean up these residual files.

Also, we need to do best effort clean-up before the flink cluster finishes. 

 

Current ha storage directory structure
{code:java}
└── /tmp/flink/ha
    ├── submittedJobGraphxxxx
    ├── completedCheckpointxxxx
    ├── application_xxxx_xxxx
    │   ├── blob{code}
 

The new ha storage directory structure
{code:java}
└── /tmp/flink/ha
    ├── application_xxxx_xxxx
    │   ├── blob
    │   ├── submittedJobGraphxxxx
    │   ├── completedCheckpointxxxx
{code}
 

 


> Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of  
> high-availability storage
> -------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-13633
>                 URL: https://issues.apache.org/jira/browse/FLINK-13633
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: Yang Wang
>            Priority: Major
>
> Currently, if we enable the high-availability, the ha storage directory 
> structure is stored as below. The submittedJobGraph and completedCheckpoint 
> are directly stored under the ha storage path. It is reasonable when the 
> flink cluster finished normally. However, when the Yarn application is failed 
> or killed, the submittedJobGraph and completedCheckpoint will exist there 
> forever. Even we could not know which flink cluster(Yarn application) they 
> belongs to. So i suggest to move them into application subdirectory. Some 
> external tools could be used to clean up these residual files.
> Also, we need to do best effort clean-up before the flink cluster finishes. 
> Current ha storage directory structure
> {code:java}
> └── <high-availability.storageDir>
>     ├── submittedJobGraph
>     ├                  ├ <jobgraph1>(random named)
>     ├                  ├ <jobgraph2>(random named)
>     ├── completedCheckpoint
>     ├              ├ <checkpoint1>(random named)
>     ├              ├ <checkpoint2>(random named)
>     ├              ├ <checkpoint3>(random named)
>     ├── <high-availability.cluster-id>
>            ├── blob
>                   ├── <blob1>(named as [no_job|job_<job-id>]/blob_<blob-key>)
> {code}
>  
> The new ha storage directory structure
> {code:java}
> └── <high-availability.storageDir>
>     ├── <high-availability.cluster-id>
>               ├── submittedJobGraph
>               ├                  ├ <jobgraph1>(random named)
>               ├                  ├ <jobgraph2>(random named)
>               ├── completedCheckpoint
>               ├               ├ <checkpoint1>(random named)
>               ├               ├ <checkpoint2>(random named)
>               ├               ├ <checkpoint1>(random named)
>               ├── blob
>                      ├── <blob1>(named as 
> [no_job|job_<job-id>]/blob_<blob-key>) {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (FLINK-13633) Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of high-availability storage

Reply via email to