[jira] [Comment Edited] (FLINK-12183) Job Cluster doesn't stop after cancel a running job in per-job Yarn mode

lamber-ken (JIRA) Sun, 28 Apr 2019 03:32:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827901#comment-16827901
 ]


lamber-ken edited comment on FLINK-12183 at 4/28/19 10:29 AM:
--------------------------------------------------------------

[~Yumeng], hi, What a coincidence! We met the same problem. I'm sorry to create 
a duplicate issue. When I create FLINK-12247, I just check the lastest 
SubtaskExecutionAttemptDetailsHandler and  
SubtaskExecutionAttemptAccumulatorsHandler in github's master branch, and found 
this problem exits also.

*First,* This problem has been bothering us for a long time. This problem has 
always existed from flink-1.3.2 version to flink-1.6.3 version. As you says, 
it's hard to find, so I also use uml to describe the flow 
https://issues.apache.org/jira/browse/FLINK-12219.

*Second,* At the beginning, I use a way to solve this problem like your patch. 
But I think it's not so well, because it breaks the interface and 
[ExecutionVertex.java|https://github.com/apache/flink/pull/8163/files#diff-52349a7928cbb1217a0704390cedbee3]
 no need to implement it. BTW, {color:#660e7a}priorExecutions 
{color}{color:#333333}is a {color}EvictingBoundedList, so it's very efficient 
that judge the element exists or not just by index. Generally, we just allow 
streaming job only attempt several times, not up to 1000 times. So just skip 
null value may be more appropriate from my side.

{color:#333333}*Third,* to{color} prevention unexpected RuntimeException, we 
should move `jobTerminationFuture.complete` to a finally block

 
{code:java}
protected void jobReachedGloballyTerminalState(ArchivedExecutionGraph 
archivedExecutionGraph) {

    try {
        super.jobReachedGloballyTerminalState(archivedExecutionGraph);
    } catch (Exception e) {
        log.error("jobReachedGloballyTerminalState exception", e);
    } finally {
        if (executionMode == ClusterEntrypoint.ExecutionMode.DETACHED) {
            // shut down since we don't have to wait for the execution result 
retrieval
            
jobTerminationFuture.complete(ApplicationStatus.fromJobStatus(archivedExecutionGraph.getState()));
        }
    }
}

{code}
 

 


was (Author: lamber-ken):
[~Yumeng], hi, What a coincidence! We met the same problem. I'm sorry to create 
a duplicate issue. When I create FLINK-12247, I just check the lastest 
SubtaskExecutionAttemptDetailsHandler and  
SubtaskExecutionAttemptAccumulatorsHandler in github's master branch, and found 
this problem exits too.

*First,* This problem has been bothering us for a long time. This problem has 
always existed from flink-1.3.2 version to flink-1.6.3 version. As you says, 
it's hard to find, so I also use uml to describe the flow 
https://issues.apache.org/jira/browse/FLINK-12219.

*Second,* At the beginning, I use a way to solve this problem like your patch. 
But I think it's not so well, because it breaks the interface and 
[ExecutionVertex.java|https://github.com/apache/flink/pull/8163/files#diff-52349a7928cbb1217a0704390cedbee3]
 no need to implement it. BTW, {color:#660e7a}priorExecutions 
{color}{color:#333333}is a {color}EvictingBoundedList, so it's very efficient 
that judge the element exists or not just by index. Generally, we just allow 
streaming job only attempt several times, not up to 1000 times. So just skip 
null value may be more appropriate from my side.

{color:#333333}*Third,* to{color} prevention unexpected RuntimeException, we 
should move `jobTerminationFuture.complete` to a finally block

 
{code:java}
protected void jobReachedGloballyTerminalState(ArchivedExecutionGraph 
archivedExecutionGraph) {

    try {
        super.jobReachedGloballyTerminalState(archivedExecutionGraph);
    } catch (Exception e) {
        log.error("jobReachedGloballyTerminalState exception", e);
    } finally {
        if (executionMode == ClusterEntrypoint.ExecutionMode.DETACHED) {
            // shut down since we don't have to wait for the execution result 
retrieval
            
jobTerminationFuture.complete(ApplicationStatus.fromJobStatus(archivedExecutionGraph.getState()));
        }
    }
}

{code}
 

 

> Job Cluster doesn't stop after cancel a running job in per-job Yarn mode
> ------------------------------------------------------------------------
>
>                 Key: FLINK-12183
>                 URL: https://issues.apache.org/jira/browse/FLINK-12183
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / REST
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>            Reporter: Yumeng Zhang
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The per-job Yarn cluster doesn't stop after cancel a running job if the job 
> restarted many times, like 1000 times, in a short time.
> The bug is in archiveExecutionGraph() phase before executing 
> removeJobAndRegisterTerminationFuture(). The CompletableFuture thread will 
> exit unexpectedly with NullPointerException in archiveExecutionGraph() phase. 
> It's hard to find that because here it only catches IOException. In 
> SubtaskExecutionAttemptDetailsHandler and  
> SubtaskExecutionAttemptAccumulatorsHandler, when calling 
> archiveJsonWithPath() method, it will construct some json information about 
> prior execution attempts but the index is from 0 which might be dropped index 
> for the for loop.  In default, it will return null when trying to get the 
> prior execution attempt (AccessExecution attempt = 
> subtask.getPriorExecutionAttempt(x)).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (FLINK-12183) Job Cluster doesn't stop after cancel a running job in per-job Yarn mode

Reply via email to