[jira] [Commented] (FLINK-12183) Job Cluster doesn't stop after cancel a running job in per-job Yarn mode

lamber-ken (JIRA) Sun, 28 Apr 2019 03:20:51 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827901#comment-16827901
 ]


lamber-ken commented on FLINK-12183:
------------------------------------

[~Yumeng], hi, What a coincidence! We met the same problem. I'm sorry to create 
a duplicate issue. When I create FLINK-12247, I just check the lastest 
SubtaskExecutionAttemptDetailsHandler and  
SubtaskExecutionAttemptAccumulatorsHandler in github's master branch, and found 
this problem exits too.

*First,* This problem has been bothering us for a long time. This problem has 
always existed from flink-1.3.2 version to flink-1.6.3 version. As you says, 
it's hard to find, so I also use uml to describe the flow 
https://issues.apache.org/jira/browse/FLINK-12219. 

*Second,* At the beginning, I use a way to solve this problem like your patch. 
But I think it's not so well, because it breaks the interface and 
[ExecutionVertex.java|https://github.com/apache/flink/pull/8163/files#diff-52349a7928cbb1217a0704390cedbee3]
 no need to implement it. BTW, {color:#660e7a}priorExecutions 
{color}{color:#333333}is a {color}EvictingBoundedList, so judge the element 
exists or not just by index. So just skip null value may be more appropriate 
from my side.

{color:#333333}*Third,* to{color} prevention unexpected RuntimeException, we 
should move `jobTerminationFuture.complete` to a finally block{color:#660e7a}
{color}

 
{code:java}
protected void jobReachedGloballyTerminalState(ArchivedExecutionGraph 
archivedExecutionGraph) {

    try {
        super.jobReachedGloballyTerminalState(archivedExecutionGraph);
    } catch (Exception e) {
        log.error("jobReachedGloballyTerminalState exception", e);
    } finally {
        if (executionMode == ClusterEntrypoint.ExecutionMode.DETACHED) {
            // shut down since we don't have to wait for the execution result 
retrieval
            
jobTerminationFuture.complete(ApplicationStatus.fromJobStatus(archivedExecutionGraph.getState()));
        }
    }
}

{code}
 

 

 

 

 

 

 

> Job Cluster doesn't stop after cancel a running job in per-job Yarn mode
> ------------------------------------------------------------------------
>
>                 Key: FLINK-12183
>                 URL: https://issues.apache.org/jira/browse/FLINK-12183
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / REST
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>            Reporter: Yumeng Zhang
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The per-job Yarn cluster doesn't stop after cancel a running job if the job 
> restarted many times, like 1000 times, in a short time.
> The bug is in archiveExecutionGraph() phase before executing 
> removeJobAndRegisterTerminationFuture(). The CompletableFuture thread will 
> exit unexpectedly with NullPointerException in archiveExecutionGraph() phase. 
> It's hard to find that because here it only catches IOException. In 
> SubtaskExecutionAttemptDetailsHandler and  
> SubtaskExecutionAttemptAccumulatorsHandler, when calling 
> archiveJsonWithPath() method, it will construct some json information about 
> prior execution attempts but the index is from 0 which might be dropped index 
> for the for loop.  In default, it will return null when trying to get the 
> prior execution attempt (AccessExecution attempt = 
> subtask.getPriorExecutionAttempt(x)).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12183) Job Cluster doesn't stop after cancel a running job in per-job Yarn mode

Reply via email to