[
https://issues.apache.org/jira/browse/FLINK-33483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784007#comment-17784007
]
Xin Chen edited comment on FLINK-33483 at 11/8/23 12:06 PM:
------------------------------------------------------------
But in another scenario in production practice, UN also appears. The Jm log can
be found in the file [^container_e15_1693914709123_8498_01_000001_8042] , but
I have not fully reproduced this scene. Based on the key information in the log:
{code:java}
15:00:57.657 State change: SUSPENDED
Connection to ZooKeeper suspended, waiting for reconnection.
15:00:54.754 org.apache.flink.util.FlinkException: ResourceManager
leader changed to new address null
15:00:54.759 Job DataDistribution$ (281592085ed7f391ab59b83a53c40db3)
switched from state RUNNING to RESTARTING.
15:00:54.771 Job DataDistribution$ (281592085ed7f391ab59b83a53c40db3)
switched from state RESTARTING to SUSPENDED.
org.apache.flink.util.FlinkException: JobManager is no longer the leader.
Unable to canonicalize address zookeeper:2181 because it's not resolvable.
15:00:55.694 closing socket connection and attempting reconnect
15:00:57.657 State change: RECONNECTED
15:00:57.739 Connection to ZooKeeper was reconnected. Leader retrieval can be
restarted.
15:00:57.740 Connection to ZooKeeper was reconnected. Leader election can be
restarted.
15:00:57.741 Job 281592085ed7f391ab59b83a53c40db3 was not finished by
JobManager.
15:00:57.742 Shutting down cluster because job not finished
15:00:57.742 Shutting YarnJobClusterEntrypoint down with application status
UNKNOWN. Diagnostics null.
{code}
>From the logs, it can be seen that there was a disconnection of zk for a few
>seconds. During the disconnection period, rm(resourcemanager) was affected and
>the Flink task was suspended, attempting to reconnect zk.
was (Author: JIRAUSER298666):
But in another scenario in production practice, UN also appears. The Jm log can
be found in the file [^container_e15_1693914709123_8498_01_000001_8042] , but
I have not fully reproduced this scene. Based on the key information in the
log, it can be seen that:
{code:java}
15:00:57.657 State change: SUSPENDED
Connection to ZooKeeper suspended, waiting for reconnection.
15:00:54.754 org.apache.flink.util.FlinkException: ResourceManager
leader changed to new address null
15:00:54.759 Job DataDistribution$ (281592085ed7f391ab59b83a53c40db3)
switched from state RUNNING to RESTARTING.
15:00:54.771 Job DataDistribution$ (281592085ed7f391ab59b83a53c40db3)
switched from state RESTARTING to SUSPENDED.
org.apache.flink.util.FlinkException: JobManager is no longer the leader.
Unable to canonicalize address zookeeper:2181 because it's not resolvable.
15:00:55.694 closing socket connection and attempting reconnect
15:00:57.657 State change: RECONNECTED
15:00:57.739 Connection to ZooKeeper was reconnected. Leader retrieval can be
restarted.
15:00:57.740 Connection to ZooKeeper was reconnected. Leader election can be
restarted.
{code}
> Why is “UNDEFINED” defined in the Flink task status?
> ----------------------------------------------------
>
> Key: FLINK-33483
> URL: https://issues.apache.org/jira/browse/FLINK-33483
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / RPC, Runtime / Task
> Affects Versions: 1.12.2
> Reporter: Xin Chen
> Priority: Major
> Attachments: container_e15_1693914709123_8498_01_000001_8042
>
>
> In the Flink on Yarn mode, if an unknown status appears in the Flink log,
> jm(jobmanager) will report the task status as undefined. The Yarn page will
> display the state as FINISHED, but the final status is *UNDEFINED*. In terms
> of business, it is unknown whether the task has failed or succeeded, and
> whether to retry. It has a certain impact. Why should we design UNDEFINED?
> Usually, this situation occurs due to zk(zookeeper) disconnection or jm
> abnormality, etc. Since the abnormality is present, why not use FAILED?
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)