[jira] [Commented] (FLINK-12183) Job Cluster doesn't stop after cancel a running job in per-job Yarn mode

Yumeng Zhang (JIRA) Sun, 28 Apr 2019 05:46:14 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827982#comment-16827982
 ]


Yumeng Zhang commented on FLINK-12183:
--------------------------------------

lamber-ken, I agree with you. It's not a good idea to break the interface. But 
like I said just skip null values looks like a little weird. It seems we don't 
know why there are null values. 
For the attempt times, in general, we cannot let the streaming job attempt many 
times. But think about this case, if our streaming job hits a bad record that 
it cannot handle ,and we have enabled the checkpoint and not configured restart 
strategy, it will try to recover again and again until we find that. The number 
of attempts depends on how soon we find this. We cannot guarantee the number of 
attempts is just several times. Maybe it's efficient to judge the elements one 
by one in EvictingBoundedList. But I think the best way to solve this null 
pointer problem is to not  break the interface and meanwhile get the right 
start index.

> Job Cluster doesn't stop after cancel a running job in per-job Yarn mode
> ------------------------------------------------------------------------
>
>                 Key: FLINK-12183
>                 URL: https://issues.apache.org/jira/browse/FLINK-12183
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / REST
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>            Reporter: Yumeng Zhang
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The per-job Yarn cluster doesn't stop after cancel a running job if the job 
> restarted many times, like 1000 times, in a short time.
> The bug is in archiveExecutionGraph() phase before executing 
> removeJobAndRegisterTerminationFuture(). The CompletableFuture thread will 
> exit unexpectedly with NullPointerException in archiveExecutionGraph() phase. 
> It's hard to find that because here it only catches IOException. In 
> SubtaskExecutionAttemptDetailsHandler and  
> SubtaskExecutionAttemptAccumulatorsHandler, when calling 
> archiveJsonWithPath() method, it will construct some json information about 
> prior execution attempts but the index is from 0 which might be dropped index 
> for the for loop.  In default, it will return null when trying to get the 
> prior execution attempt (AccessExecution attempt = 
> subtask.getPriorExecutionAttempt(x)).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12183) Job Cluster doesn't stop after cancel a running job in per-job Yarn mode

Reply via email to