liad shachoach created FLINK-29620:
--------------------------------------
Summary: Flink deployment stuck in UPGRADING state when changing
configuration
Key: FLINK-29620
URL: https://issues.apache.org/jira/browse/FLINK-29620
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Affects Versions: 1.14.2
Environment: AWS EKS v1.21
Operator version: 1.1.0
Reporter: liad shachoach
When I update the configuration of a flink deployment I observe one of two
scenarios:
Success:
This happens when the job has not started - if I change the configuration quick
enough:
{code:java}
2022-10-13 06:50:54,336 o.a.f.k.o.r.d.AbstractJobReconciler [INFO
][load-streaming/validator-process-124] Upgrading/Restarting running job,
suspending first...
2022-10-13 06:50:54,343 o.a.f.k.o.r.d.ApplicationReconciler [INFO
][load-streaming/validator-process-124] Job is not running but HA metadata is
available for last state restore, ready for upgrade
2022-10-13 06:50:54,353 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Deleting JobManager deployment while
preserving HA metadata.
2022-10-13 06:50:58,415 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
2022-10-13 06:51:03,451 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
2022-10-13 06:51:06,469 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Cluster shutdown completed.
2022-10-13 06:51:06,470 o.a.f.k.o.c.FlinkDeploymentController [INFO
][load-streaming/validator-process-124] End of reconciliation
2022-10-13 06:51:06,493 o.a.f.k.o.c.FlinkDeploymentController [INFO
][load-streaming/validator-process-124] Starting reconciliation
2022-10-13 06:51:06,494 o.a.f.k.o.c.FlinkConfigManager [INFO
][load-streaming/validator-process-124] Generating new config
{code}
In this scenario I see that the job manager and task manager pods are
terminated and then recreated.
Failure:
This happens when I let the job start (wait more than 30-60 seconds) and change
the configuration:
{code:java}
2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO
][load-streaming/validator-process-124] Upgrading/Restarting running job,
suspending first...
2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO
][load-streaming/validator-process-124] Job is in running state, ready for
upgrade with SAVEPOINT
2022-10-13 06:53:06,659 o.a.f.k.o.s.FlinkService [INFO
][load-streaming/validator-process-124] Suspending job with savepoint.
2022-10-13 06:53:07,042 o.a.f.k.o.s.FlinkService [INFO
][load-streaming/validator-process-124] Job successfully suspended with
savepoint
s3://cu-flink-load-checkpoints-us-east-1/validator-process-124/savepoints/savepoint-000000-947975b509b2.
2022-10-13 06:53:11,111 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
2022-10-13 06:53:16,176 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
2022-10-13 06:53:21,238 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Waiting for cluster shutdown... (15s)
2022-10-13 06:53:26,293 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Waiting for cluster shutdown... (20s)
2022-10-13 06:53:31,355 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Waiting for cluster shutdown... (25s)
2022-10-13 06:53:36,412 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Waiting for cluster shutdown... (30s)
2022-10-13 06:53:41,512 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Waiting for cluster shutdown... (35s)
2022-10-13 06:53:46,568 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Waiting for cluster shutdown... (40s)
2022-10-13 06:53:51,625 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Waiting for cluster shutdown... (45s)
2022-10-13 06:53:56,740 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Waiting for cluster shutdown... (50s)
2022-10-13 06:54:01,811 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Waiting for cluster shutdown... (55s)
2022-10-13 06:54:06,866 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Waiting for cluster shutdown... (60s)
2022-10-13 06:54:07,866 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Cluster shutdown completed.
2022-10-13 06:54:07,866 o.a.f.k.o.c.FlinkDeploymentController [INFO
][load-streaming/validator-process-124] End of reconciliation
2022-10-13 06:54:07,894 o.a.f.k.o.c.FlinkDeploymentController [INFO
][load-streaming/validator-process-124] Starting reconciliation
2022-10-13 06:54:07,894 o.a.f.k.o.o.d.ApplicationObserver [WARN
][load-streaming/validator-process-124] Running deployment generation 3 doesn't
match upgrade target generation 4.
2022-10-13 06:54:07,895 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO
][load-streaming/validator-process-124] Detected spec change, starting
reconciliation.
2022-10-13 06:54:07,941 o.a.f.k.o.s.FlinkService [INFO
][load-streaming/validator-process-124] Deploying application cluster
2022-10-13 06:54:07,947 o.a.f.k.o.u.FlinkUtils [INFO
][load-streaming/validator-process-124] Job graph in ConfigMap
validator-process-124-dispatcher-leader is deleted
2022-10-13 06:54:08,029 o.a.f.c.d.a.c.ApplicationClusterDeployer [INFO
][load-streaming/validator-process-124] Submitting application in 'Application
Mode'.
2022-10-13 06:54:08,031 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
][load-streaming/validator-process-124] The derived from fraction jvm overhead
memory (102.400mb (107374184 bytes)) is less than its min value 192.000mb
(201326592 bytes), min value will be used instead
2022-10-13 06:54:08,087 o.a.f.k.o.r.ReconciliationUtils [WARN
][load-streaming/validator-process-124] Attempt count: 0, last attempt: false
2022-10-13 06:54:08,111 i.j.o.p.e.ReconciliationDispatcher
[ERROR][load-streaming/validator-process-124] Error during event processing
ExecutionScope{ resource id: ResourceID{name='validator-process-124',
namespace='load-streaming'}, version: 1116792084} failed.
org.apache.flink.kubernetes.operator.exception.ReconciliationException:
org.apache.flink.client.deployment.ClusterDeploymentException: The Flink
cluster validator-process-124 already exists.
at
org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:119)
at
org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54)
at
io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:201)
at
io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:153)
at
org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:83)
at
io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:152)
at
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:135)
at
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:115)
at
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:86)
at
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:59)
at
io.javaoperatorsdk.operator.processing.event.EventProcessor$ControllerExecution.run(EventProcessor.java:390)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: The
Flink cluster validator-process-124 already exists.
at
org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployApplicationCluster(KubernetesClusterDescriptor.java:181)
at
org.apache.flink.client.deployment.application.cli.ApplicationClusterDeployer.run(ApplicationClusterDeployer.java:67)
at
org.apache.flink.kubernetes.operator.service.FlinkService.submitApplicationCluster(FlinkService.java:200)
at
org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:155)
at
org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:52)
at
org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.restoreJob(AbstractJobReconciler.java:188)
at
org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.reconcileSpecChange(AbstractJobReconciler.java:122)
at
org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:145)
at
org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:55)
at
org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:115)
... 13 more {code}
In this scenario I see that the job manager pod is restarted (not recreated),
task manager pods are not updated, flink config maps are not updated.
The flink deployment state changes to UPGRADING and the above exception is
repeated.
Flink deployment spec:
{code:java}
flinkVersion: v1_14
job:
allowNonRestoredState: true
args: ...
entryClass: ...
jarURI: ...
parallelism: x
savepointTriggerNonce: 0
state: running
upgradeMode: savepoint
jobManager:
podTemplate:
apiVersion: v1
kind: Pod
metadata:
annotations:
configmap.reloader.stakater.com/reload:
flink-config-validator-process-124,pod-template-validator-process-124
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nodeType
operator: In
values:
- someValue
containers:
- name: flink-main-container
resources:
limits:
cpu: "1"
memory: 1.6Gi
requests:
cpu: "0.2"
memory: 1Gi
tolerations:
- effect: NoSchedule
key: someValue
value: "true"
replicas: 1
podTemplate:
apiVersion: v1
kind: Pod
metadata:
annotations:
configmap.reloader.stakater.com/reload:
flink-config-validator-process-124,pod-template-validator-process-124
prometheus.io/path: /metrics
prometheus.io/port: "9260"
prometheus.io/scrape: "true"
labels:
app.kubernetes.io/instance: flink-validator-process-124
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: apache-flink
app.kubernetes.io/version: test
helm.sh/chart: apache-flink-1.0.0
spec:
containers: []
imagePullSecrets: []
serviceAccount: validator-process-124
taskManager:
podTemplate:
apiVersion: v1
kind: Pod
metadata:
annotations:
configmap.reloader.stakater.com/reload:
flink-config-validator-process-124,pod-template-validator-process-124
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nodeType
operator: In
values:
- someValue
containers:
- name: flink-main-container
resources:
limits:
cpu: "1"
memory: 3.6Gi
requests:
cpu: "0.2"
memory: 3Gi
tolerations:
- effect: NoSchedule
key: someValue
value: "true"{code}
Please let me know if more details are required.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)