[
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849473#comment-16849473
]
Gary Yao commented on FLINK-12302:
----------------------------------
Hi [~lamber-ken],
Thanks for the update. If I understand correctly, after the job transitions
into a globally terminal state ({{FAILED}}), you kill the ApplicationMaster
(AM). Because you kill the AM, the application cannot be de-registered from
YARN. The new AM, which is brought up by YARN, sees in the
{{RunningJobsRegistry}} that the job is already in a terminal state, and we run
into the _"jobFinishedByOther()"_ code path [1]. Because the new AM currently
cannot know whether the job finished successfully or failed, we chose
{{UNKNOWN}} as Flink's internal application status, which in turn is mapped to
YARN's UNDEFINED final application status. Note that there are also other
places in the code where we use Flink's {{UNKNOWN}} application status [3]. The
bottom line is that I think your fix is not enough to consistently set the
application status. If your patch was applied, I think even successfully
finished jobs, could show up as {{FAILED}}.
I wonder how severe this issue is for you, and how often it occurs? It seems to
me that the AM has to be killed in a very specific moment in time to reproduce
the behavior. Please correct me if I am wrong.
Moreover, I wonder if using {{FinalApplicationStatus.UNDEFINED}} is wrong, per
se. The Javadoc for {{FinalApplicationStatus.UNDEFINED}} reads:
{noformat}
Undefined state when either the application has not yet finished
{noformat}
The term _"either"_ implies that there is an alternative meaning, i.e., it
looks like the original author of the Javadoc forgot to finish the sentence.
Also the fact that it can be set by the user, implies that it is a valid final
status.
[1]
[https://github.com/apache/flink/blob/58987dd16c7e8af36e935e811f716d2f843de5ca/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunner.java#L243]
[2]
[https://github.com/apache/flink/blob/58987dd16c7e8af36e935e811f716d2f843de5ca/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L494]
[3]
[https://github.com/apache/flink/blob/58987dd16c7e8af36e935e811f716d2f843de5ca/flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntrypoint.java#L226-L229]
> Fixed the wrong finalStatus of yarn application when application finished
> -------------------------------------------------------------------------
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
> Issue Type: Improvement
> Components: Deployment / YARN
> Affects Versions: 1.8.0
> Reporter: lamber-ken
> Assignee: lamber-ken
> Priority: Minor
> Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml,
> image-2019-04-23-19-56-49-933.png, image-2019-05-28-00-46-49-740.png,
> image-2019-05-28-00-50-13-500.png, jobmanager-05-27.log, jobmanager-1.log,
> jobmanager-2.log, screenshot-1.png, screenshot-2.png,
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus
> is +UNDEFINED,+ It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)