[jira] [Commented] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

Gary Yao (JIRA) Tue, 28 May 2019 01:33:08 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849473#comment-16849473
 ]


Gary Yao commented on FLINK-12302:
----------------------------------

Hi [~lamber-ken],

Thanks for the update. If I understand correctly, after the job transitions 
into a globally terminal state ({{FAILED}}), you kill the ApplicationMaster 
(AM). Because you kill the AM, the application cannot be de-registered from 
YARN. The new AM, which is brought up by YARN, sees in the 
{{RunningJobsRegistry}} that the job is already in a terminal state, and we run 
into the _"jobFinishedByOther()"_ code path [1]. Because the new AM currently 
cannot know whether the job finished successfully or failed, we chose 
{{UNKNOWN}} as Flink's internal application status, which in turn is mapped to 
YARN's UNDEFINED final application status. Note that there are also other 
places in the code where we use Flink's {{UNKNOWN}} application status [3]. The 
bottom line is that I think your fix is not enough to consistently set the 
application status. If your patch was applied, I think even successfully 
finished jobs, could show up as {{FAILED}}.

I wonder how severe this issue is for you, and how often it occurs? It seems to 
me that the AM has to be killed in a very specific moment in time to reproduce 
the behavior. Please correct me if I am wrong.

Moreover, I wonder if using {{FinalApplicationStatus.UNDEFINED}} is wrong, per 
se. The Javadoc for {{FinalApplicationStatus.UNDEFINED}} reads:
{noformat}
Undefined state when either the application has not yet finished
{noformat}
The term _"either"_ implies that there is an alternative meaning, i.e., it 
looks like the original author of the Javadoc forgot to finish the sentence. 
Also the fact that it can be set by the user, implies that it is a valid final 
status.

[1] 
[https://github.com/apache/flink/blob/58987dd16c7e8af36e935e811f716d2f843de5ca/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobManagerRunner.java#L243]
 
[2] 
[https://github.com/apache/flink/blob/58987dd16c7e8af36e935e811f716d2f843de5ca/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L494]
[3] 
[https://github.com/apache/flink/blob/58987dd16c7e8af36e935e811f716d2f843de5ca/flink-runtime/src/main/java/org/apache/flink/runtime/entrypoint/ClusterEntrypoint.java#L226-L229]

> Fixed the wrong finalStatus of yarn application when application finished
> -------------------------------------------------------------------------
>
>                 Key: FLINK-12302
>                 URL: https://issues.apache.org/jira/browse/FLINK-12302
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / YARN
>    Affects Versions: 1.8.0
>            Reporter: lamber-ken
>            Assignee: lamber-ken
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 1.9.0
>
>         Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, image-2019-05-28-00-46-49-740.png, 
> image-2019-05-28-00-50-13-500.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

Reply via email to