Rasmus Bilgram created FLINK-38106:
--------------------------------------
Summary: Job gets indefinitely stuck with "Job Not Found" events
Key: FLINK-38106
URL: https://issues.apache.org/jira/browse/FLINK-38106
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Affects Versions: 1.12.1
Reporter: Rasmus Bilgram
We are running flink jobs using last-state upgradeMode. We have experienced
that when upgrading the job with a different job graph the job ends up in a
undesireable state where we only see "Job Not Found" events, no HA metadata and
restoring is only possible from latest savepoint.
>From the logs, flink does not allow changing the job graph when restoring from
>checkpoint it is only possible to do such upgrade using upgradeMode: savepoint
>and we have used that to reproduce the issue.
Steps:
1. Upgrade a job with a job graph change using last-state upgradeMode.
2. Job manager pod gets "Caused by: java.lang.IllegalStateException: There is
no operator for the state [id]" and restarts
3. When the Job manager starts /overview will return empty list of jobs to the
operator
4. Operator put RECONCILING as status - since it is not FAILED no redeployments
are attempted
5. Operator starts producing "Job Not Found" events
6. We observed that the HA metadata is also missing
7. Job is stuck until we manually restore from savepoint
Wearealittleconcernedifthiscanbecausedbyotherissues,maybeOOMonjobmanagerthenperhapsFAILEDstateisbettertotriggerretrymechanisms.
Alternatively it would be great if rollback from latest checkpoint (with former
job graph) would be possible. We tried to rollback mechanism but it complained
about no HA metadata.
It seems similar to: https://issues.apache.org/jira/browse/FLINK-32631
--
This message was sent by Atlassian Jira
(v8.20.10#820010)