[
https://issues.apache.org/jira/browse/FLINK-37320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Luca Castelli updated FLINK-37320:
----------------------------------
Description:
Hello,
I believe I've found bugs within the observation logic for both finite
streaming and batch jobs. This is a follow-up to:
[https://lists.apache.org/thread/xvsk4fmlqln092cdolvox4dgko0pw81k].
*For finite streaming jobs:*
# The job finishes successfully and the job status changes to FINISHED
# TTL (kubernetes.operator.jm-deployment.shutdown-ttl) cleanup removes the JM
deployments and clears HA configmap data
# On the next loop, the observer sees MISSING JM and changes the job status
from FINISHED to RECONCILING
The job had reached a terminal state. It shouldn't have been set back to
RECONCILING.
This leads to an operator error later when a recovery attempt is triggered. The
recovery is triggered because the JM is MISSING, the status is RECONCILING,
spec shows RUNNING, and HA enabled. The recovery fails with
validateHaMetadataExists throwing UpgradeFailureException.
At that point the deployment gets stuck in a loop with status RECONCILING and
UpgradeFailureException thrown on each cycle. I've attached operator logs
showing this.
I think the fix would be to wrap
[https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L155]
in an if-statement that checks the job is not in a terminal state. Happy to
discuss and/or put up the 2 line code change PR.
*For batch jobs:*
The root error:
{code:java}
Caused by: org.apache.flink.runtime.rest.util.RestClientException:
[org.apache.flink.runtime.rest.handler.RestHandlerException: Checkpointing has
not been enabled.
at
org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.createCheckpointingStatistics(CheckpointingStatisticsHandler.java:107){code}
# In
[#https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L570]
was:
Hello,
I believe I've found bugs within the observation logic for both finite
streaming and batch jobs. This is a follow-up to:
[https://lists.apache.org/thread/xvsk4fmlqln092cdolvox4dgko0pw81k].
*For finite streaming jobs:*
# The job finishes successfully and the job status changes to FINISHED
# TTL (kubernetes.operator.jm-deployment.shutdown-ttl) cleanup removes the JM
deployments and clears HA configmap data
# On the next loop, the observer sees MISSING JM and changes the job status
from FINISHED to RECONCILING
The job had reached a terminal state. It shouldn't have been set back to
RECONCILING.
This leads to an operator error later when a recovery attempt is triggered. The
recovery is triggered because the JM is MISSING, the status is RECONCILING,
spec shows RUNNING, and HA enabled. The recovery fails with
validateHaMetadataExists throwing UpgradeFailureException.
At that point the deployment gets stuck in a loop with status RECONCILING and
UpgradeFailureException thrown on each cycle. I've attached operator logs
showing this.
I think the fix would be to wrap
[https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L155]
in an if-statement that checks the job is not in a terminal state. Happy to
discuss and/or put up the 2 line code change PR.
*For batch jobs:*
# Batch jobs don't use checkpointing
> FINISHED jobs incorrectly being set to RECONCILING
> --------------------------------------------------
>
> Key: FLINK-37320
> URL: https://issues.apache.org/jira/browse/FLINK-37320
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.10.0
> Environment: I've attached the flinkdeployment CR and operator-config
> I used to locally replicate.
> Reporter: Luca Castelli
> Priority: Minor
> Attachments: operator-config.yaml, operator-log-batch-job.log,
> operator-log-finite-streaming-job.log, test-batch-job.yaml,
> test-finite-streaming-job.yaml
>
>
> Hello,
> I believe I've found bugs within the observation logic for both finite
> streaming and batch jobs. This is a follow-up to:
> [https://lists.apache.org/thread/xvsk4fmlqln092cdolvox4dgko0pw81k].
> *For finite streaming jobs:*
> # The job finishes successfully and the job status changes to FINISHED
> # TTL (kubernetes.operator.jm-deployment.shutdown-ttl) cleanup removes the
> JM deployments and clears HA configmap data
> # On the next loop, the observer sees MISSING JM and changes the job status
> from FINISHED to RECONCILING
> The job had reached a terminal state. It shouldn't have been set back to
> RECONCILING.
> This leads to an operator error later when a recovery attempt is triggered.
> The recovery is triggered because the JM is MISSING, the status is
> RECONCILING, spec shows RUNNING, and HA enabled. The recovery fails with
> validateHaMetadataExists throwing UpgradeFailureException.
> At that point the deployment gets stuck in a loop with status RECONCILING and
> UpgradeFailureException thrown on each cycle. I've attached operator logs
> showing this.
> I think the fix would be to wrap
> [https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L155]
> in an if-statement that checks the job is not in a terminal state. Happy to
> discuss and/or put up the 2 line code change PR.
> *For batch jobs:*
> The root error:
> {code:java}
> Caused by: org.apache.flink.runtime.rest.util.RestClientException:
> [org.apache.flink.runtime.rest.handler.RestHandlerException: Checkpointing
> has not been enabled.
> at
> org.apache.flink.runtime.rest.handler.job.checkpoints.CheckpointingStatisticsHandler.createCheckpointingStatistics(CheckpointingStatisticsHandler.java:107){code}
>
> # In
> [#https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L570]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)