[ 
https://issues.apache.org/jira/browse/FLINK-12183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yumeng Zhang updated FLINK-12183:
---------------------------------
    Description: 
The per-job Yarn cluster doesn't stop after cancel a running job if the job 
restarted many times, like 1000 times, in a short time.

The bug is in archiveExecutionGraph() phase before executing 
removeJobAndRegisterTerminationFuture(). The CompletableFuture thread will exit 
unexpectedly with NullPointerException in archiveExecutionGraph() phase. It's 
hard to find that because here it only catches IOException. In 
SubtaskExecutionAttemptDetailsHandler and  
SubtaskExecutionAttemptAccumulatorsHandler, when calling archiveJsonWithPath() 
method, it will construct some json information about prior execution attempts 
but the index is from 0 which might be dropped index for the for loop.  In 
default, it will return null when trying to get the prior execution attempt 
(AccessExecution attempt = subtask.getPriorExecutionAttempt(x)).

  was:
The per-job Yarn cluster doesn't releases resources after cancel a running job 
if the job restarted many times, like 1000 times, in a short time.

The bug is in archiveExecutionGraph() phase before executing 
removeJobAndRegisterTerminationFuture(). The CompletableFuture thread will exit 
unexpectedly with NullPointerException in archiveExecutionGraph() phase. It's 
hard to find that because here it only catches IOException. In 
SubtaskExecutionAttemptDetailsHandler and  
SubtaskExecutionAttemptAccumulatorsHandler, when calling archiveJsonWithPath() 
method, it will construct some json information about prior execution attempts 
but the index is from 0 which might be dropped index for the for loop.  In 
default, it will return null when trying to get the prior execution attempt 
(AccessExecution attempt = subtask.getPriorExecutionAttempt(x)).


> Job Cluster doesn't stop after cancel a running job in per-job Yarn mode
> ------------------------------------------------------------------------
>
>                 Key: FLINK-12183
>                 URL: https://issues.apache.org/jira/browse/FLINK-12183
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / REST
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>            Reporter: Yumeng Zhang
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The per-job Yarn cluster doesn't stop after cancel a running job if the job 
> restarted many times, like 1000 times, in a short time.
> The bug is in archiveExecutionGraph() phase before executing 
> removeJobAndRegisterTerminationFuture(). The CompletableFuture thread will 
> exit unexpectedly with NullPointerException in archiveExecutionGraph() phase. 
> It's hard to find that because here it only catches IOException. In 
> SubtaskExecutionAttemptDetailsHandler and  
> SubtaskExecutionAttemptAccumulatorsHandler, when calling 
> archiveJsonWithPath() method, it will construct some json information about 
> prior execution attempts but the index is from 0 which might be dropped index 
> for the for loop.  In default, it will return null when trying to get the 
> prior execution attempt (AccessExecution attempt = 
> subtask.getPriorExecutionAttempt(x)).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to