[
https://issues.apache.org/jira/browse/FLINK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931738#comment-17931738
]
Alan Zhang edited comment on FLINK-36673 at 3/2/25 1:41 AM:
------------------------------------------------------------
> Checkpointing is enabled as a very last step when building the execution
> graph. If the job fails before that (e.g. when registering a source or a
> sink), the Flink runtime will return the "Checkpointing has not been enabled"
> exception.
[~gyfora] I observed an exact same problem as [~sap1ens] mentioned here, and
I'm using operator 1.10 and Flink 1.16. And I think this explanation makes
sense.
I have one Flink application which consumes data from Kafka topic, however
somehow I used a wrong topic name which results schema not found.
{code:java}
org.apache.flink.client.program.ProgramInvocationException: The main method
caused an error: Error fetching avro schema for topic Samza-PageViewEvent1
at
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)
at
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98)
{code}
It seems JM didn't go through the logic that enabling the checkpoints, this
error occurred before checkpointing. I could see this Flink job be marked as
"failed" status[1], and I see JM return exception "Checkpointing has not been
enabled" by calling JM rest APIs.
In this case, rollback by updating the image to a last-known-good version
doesn't work because of this SnapshotObserver blocked it by throwing the
exception: "ReconciliationException: Could not observe latest savepoint
information" (full stacktrace is attached)". Also job status of the
FlinkDeployment in this case should be FAILED instead of RECONCILING[3].
[1]
!Screenshot 2025-02-28 at 4.15.26 PM.png|width=481,height=207!
[2]
!Screenshot 2025-02-28 at 8.51.37 PM.png|width=679,height=201!
[3]
!Screenshot 2025-02-28 at 8.55.36 PM.png|width=618,height=386!
was (Author: alnzng):
> Checkpointing is enabled as a very last step when building the execution
> graph. If the job fails before that (e.g. when registering a source or a
> sink), the Flink runtime will return the "Checkpointing has not been enabled"
> exception.
[~gyfora] I observed an exact same problem as [~sap1ens] mentioned here, and
I'm using operator 1.10 and Flink 1.16. And I think this explanation makes
sense.
I have one Flink application which consumes data from Kafka topic, however
somehow I used a wrong topic name which results schema not found.
{code:java}
org.apache.flink.client.program.ProgramInvocationException: The main method
caused an error: Error fetching avro schema for topic Samza-PageViewEvent1
at
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)
at
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98)
{code}
It seems JM didn't go through the logic that enabling the checkpoints, this
error occurred before checkpointing. I could see this Flink job be marked as
"failed" status[1], and I see JM return exception "Checkpointing has not been
enabled" by calling JM rest APIs.
In this case, rollback by updating the image to a last-known-good version
doesn't work because of this SnapshotObserver blocked it by throwing the
exception: "ReconciliationException: Could not observe latest savepoint
information" (full stacktrace is attached)". Also job status in this case
should be FAILED instead of RECONCILING[3].
[1]
!Screenshot 2025-02-28 at 4.15.26 PM.png|width=481,height=207!
[2]
!Screenshot 2025-02-28 at 8.51.37 PM.png|width=679,height=201!
[3]
!Screenshot 2025-02-28 at 8.55.36 PM.png|width=618,height=386!
> Operator is not properly handling failed deployments without savepoints
> -----------------------------------------------------------------------
>
> Key: FLINK-36673
> URL: https://issues.apache.org/jira/browse/FLINK-36673
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Reporter: Yaroslav Tkachenko
> Priority: Major
> Attachments: Screenshot 2025-02-28 at 4.15.26 PM.png, Screenshot
> 2025-02-28 at 8.51.37 PM.png, Screenshot 2025-02-28 at 8.55.36 PM.png,
> stacktrace.txt
>
>
> I noticed an issue after upgrading Flink Kubernetes Operator from 1.9 to 1.10.
> When I deploy a FlinkDeployment that fails during the startup, I get a
> "ReconciliationException: Could not observe latest savepoint information"
> (full stacktrace is attached).
> I think the issue was introduced here:
> [https://github.com/apache/flink-kubernetes-operator/pull/871.]
> *AbstractFlinkService.getLastCheckpoint* now throws a
> *ReconciliationException* when a savepoint is not available, and
> *SnapshotObserver.observeLatestCheckpoint* doesn't handle it properly. I
> think having no savepoint is completely normal in some situations (e.g. a
> brand new job).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)