[jira] [Updated] (FLINK-34557) When the Flink task ends in application mode, there may be issues with the Znode and HDFS files not being deleted

tanliang (Jira) Fri, 01 Mar 2024 00:01:03 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-34557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


tanliang updated FLINK-34557:
-----------------------------
    Description: 
In Flink 1.16.2, we all use application mode to submit tasks to Yarn. However, 
there are several situations during use that result in Znode not being deleted 
and some files on HDFS not being deleted. These should be deleted after the 
task is stopped, otherwise it may cause some resource occupancy problems. Below 
are the several situations I have encountered:
 # After the Flink task is submitted to the cluster, if there is a conflict or 
missing jar package, the task will be restarted multiple times by Yarn and 
ultimately fail to end. At this point, it will be found that the Znode 
persists, and there are files with corresponding appids in the '/.flink' 
directory and '/flink/recovery' directory in HDFS;
 # When using the yarn kill command to kill a task, the task ends directly and 
the final state is killed, with the final result being the same as the first 
one;
 # When the Flink task is disconnected from zk (we will not analyze the 
specific reason for the disconnection), if zk is disconnected from the jm 
container, the task will hang and be pulled back by yarn. When the last 
disconnection occurs, the task will eventually end and the same result as above 
will appear;

!image-2024-03-01-15-38-48-396.png|width=877,height=171!

!image-2024-03-01-15-39-13-953.png|width=882,height=174!

 

!image-2024-03-01-15-39-39-524.png|width=1001,height=67!

 
*Add：*
Through consulting with the community and other colleagues, we found that the 
community had previously raised the issue of Znode not being deleted. Later, by 
adding the closeAndCleanupAllData # method, it was uniformly deleted at the end 
of a highly available cluster. However, in the aforementioned situations, there 
are still issues with file and data residue. Among them, when using the yarn 
kill command, after successfully submitting a task to the cluster, Flink would 
indicate through console logs that there would be HDFS file residue after 
successfully submitting tasks to the cluster, however, I don't understand why 
the community did not improve this and instead retained the existence of this 
situation. At the same time, we believe that Znode residue should not exist, 
regardless of the task status, it must be cleaned up after stopping the task

  was:
In Flink 1.16.2, we all use application mode to submit tasks to Yarn. However, 
there are several situations during use that result in Znode not being deleted 
and some files on HDFS not being deleted. These should be deleted after the 
task is stopped, otherwise it may cause some resource occupancy problems. Below 
are the several situations I have encountered:
 # After the Flink task is submitted to the cluster, if there is a conflict or 
missing jar package, the task will be restarted multiple times by Yarn and 
ultimately fail to end. At this point, it will be found that the Znode 
persists, and there are files with corresponding appids in the '/.flink' 
directory and '/flink/recovery' directory in HDFS
 # When using the yarn kill command to kill a task, the task ends directly and 
the final state is killed, with the final result being the same as the first one
 # When the Flink task is disconnected from zk (we will not analyze the 
specific reason for the disconnection), if zk is disconnected from the jm 
container, the task will hang and be pulled back by yarn. When the last 
disconnection occurs, the task will eventually end and the same result as above 
will appear

!image-2024-03-01-15-38-48-396.png|width=877,height=171!

!image-2024-03-01-15-39-13-953.png|width=882,height=174!

!image-2024-03-01-15-39-39-524.png|width=1001,height=67!

 
*Add：*
Through consulting with the community and other colleagues, we found that the 
community had previously raised the issue of Znode not being deleted. Later, by 
adding the closeAndCleanupAllData # method, it was uniformly deleted at the end 
of a highly available cluster. However, in the aforementioned situations, there 
are still issues with file and data residue. Among them, when using the yarn 
kill command, after successfully submitting a task to the cluster, Flink would 
indicate through console logs that there would be HDFS file residue after 
successfully submitting tasks to the cluster, however, I don't understand why 
the community did not improve this and instead retained the existence of this 
situation. At the same time, we believe that Znode residue should not exist, 
regardless of the task status, it must be cleaned up after stopping the task


> When the Flink task ends in application mode, there may be issues with the 
> Znode and HDFS files not being deleted
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-34557
>                 URL: https://issues.apache.org/jira/browse/FLINK-34557
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / YARN, Runtime / Task
>    Affects Versions: 1.17.0, 1.16.2
>            Reporter: tanliang
>            Priority: Major
>         Attachments: image-2024-03-01-15-38-48-396.png, 
> image-2024-03-01-15-39-13-953.png, image-2024-03-01-15-39-39-524.png
>
>
> In Flink 1.16.2, we all use application mode to submit tasks to Yarn. 
> However, there are several situations during use that result in Znode not 
> being deleted and some files on HDFS not being deleted. These should be 
> deleted after the task is stopped, otherwise it may cause some resource 
> occupancy problems. Below are the several situations I have encountered:
>  # After the Flink task is submitted to the cluster, if there is a conflict 
> or missing jar package, the task will be restarted multiple times by Yarn and 
> ultimately fail to end. At this point, it will be found that the Znode 
> persists, and there are files with corresponding appids in the '/.flink' 
> directory and '/flink/recovery' directory in HDFS;
>  # When using the yarn kill command to kill a task, the task ends directly 
> and the final state is killed, with the final result being the same as the 
> first one;
>  # When the Flink task is disconnected from zk (we will not analyze the 
> specific reason for the disconnection), if zk is disconnected from the jm 
> container, the task will hang and be pulled back by yarn. When the last 
> disconnection occurs, the task will eventually end and the same result as 
> above will appear;
> !image-2024-03-01-15-38-48-396.png|width=877,height=171!
> !image-2024-03-01-15-39-13-953.png|width=882,height=174!
>  
> !image-2024-03-01-15-39-39-524.png|width=1001,height=67!
>  
> *Add：*
> Through consulting with the community and other colleagues, we found that the 
> community had previously raised the issue of Znode not being deleted. Later, 
> by adding the closeAndCleanupAllData # method, it was uniformly deleted at 
> the end of a highly available cluster. However, in the aforementioned 
> situations, there are still issues with file and data residue. Among them, 
> when using the yarn kill command, after successfully submitting a task to the 
> cluster, Flink would indicate through console logs that there would be HDFS 
> file residue after successfully submitting tasks to the cluster, however, I 
> don't understand why the community did not improve this and instead retained 
> the existence of this situation. At the same time, we believe that Znode 
> residue should not exist, regardless of the task status, it must be cleaned 
> up after stopping the task



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-34557) When the Flink task ends in application mode, there may be issues with the Znode and HDFS files not being deleted

Reply via email to