[
https://issues.apache.org/jira/browse/FLINK-12183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827901#comment-16827901
]
lamber-ken commented on FLINK-12183:
------------------------------------
[~Yumeng], hi, What a coincidence! We met the same problem. I'm sorry to create
a duplicate issue. When I create FLINK-12247, I just check the lastest
SubtaskExecutionAttemptDetailsHandler and
SubtaskExecutionAttemptAccumulatorsHandler in github's master branch, and found
this problem exits too.
*First,* This problem has been bothering us for a long time. This problem has
always existed from flink-1.3.2 version to flink-1.6.3 version. As you says,
it's hard to find, so I also use uml to describe the flow
https://issues.apache.org/jira/browse/FLINK-12219.
*Second,* At the beginning, I use a way to solve this problem like your patch.
But I think it's not so well, because it breaks the interface and
[ExecutionVertex.java|https://github.com/apache/flink/pull/8163/files#diff-52349a7928cbb1217a0704390cedbee3]
no need to implement it. BTW, {color:#660e7a}priorExecutions
{color}{color:#333333}is a {color}EvictingBoundedList, so judge the element
exists or not just by index. So just skip null value may be more appropriate
from my side.
{color:#333333}*Third,* to{color} prevention unexpected RuntimeException, we
should move `jobTerminationFuture.complete` to a finally block{color:#660e7a}
{color}
{code:java}
protected void jobReachedGloballyTerminalState(ArchivedExecutionGraph
archivedExecutionGraph) {
try {
super.jobReachedGloballyTerminalState(archivedExecutionGraph);
} catch (Exception e) {
log.error("jobReachedGloballyTerminalState exception", e);
} finally {
if (executionMode == ClusterEntrypoint.ExecutionMode.DETACHED) {
// shut down since we don't have to wait for the execution result
retrieval
jobTerminationFuture.complete(ApplicationStatus.fromJobStatus(archivedExecutionGraph.getState()));
}
}
}
{code}
> Job Cluster doesn't stop after cancel a running job in per-job Yarn mode
> ------------------------------------------------------------------------
>
> Key: FLINK-12183
> URL: https://issues.apache.org/jira/browse/FLINK-12183
> Project: Flink
> Issue Type: Bug
> Components: Runtime / REST
> Affects Versions: 1.6.4, 1.7.2, 1.8.0
> Reporter: Yumeng Zhang
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The per-job Yarn cluster doesn't stop after cancel a running job if the job
> restarted many times, like 1000 times, in a short time.
> The bug is in archiveExecutionGraph() phase before executing
> removeJobAndRegisterTerminationFuture(). The CompletableFuture thread will
> exit unexpectedly with NullPointerException in archiveExecutionGraph() phase.
> It's hard to find that because here it only catches IOException. In
> SubtaskExecutionAttemptDetailsHandler and
> SubtaskExecutionAttemptAccumulatorsHandler, when calling
> archiveJsonWithPath() method, it will construct some json information about
> prior execution attempts but the index is from 0 which might be dropped index
> for the for loop. In default, it will return null when trying to get the
> prior execution attempt (AccessExecution attempt =
> subtask.getPriorExecutionAttempt(x)).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)