[ 
https://issues.apache.org/jira/browse/FLINK-33108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766355#comment-17766355
 ] 

Márton Balassi edited comment on FLINK-33108 at 9/18/23 12:25 PM:
------------------------------------------------------------------

The total resources used in the clusterInfo of the status is updated by the 
operator in the reconcile loop. We are relying on this for application mode 
metrics.

I suspect that the issue is that the two session jobs are adding now resources 
(taskmanagers) concurrently and this leads to this issue. One possible solution 
that we could consider is simply removing these from the session cluster and 
only keeping them at the individual job level. Otherwise we need to solve for 
the concurrency of these updates.


was (Author: mbalassi):
That value change represents when the taskmanager has been successfully added 
to the test application and thus the total resource utilization is raised. We 
should make it so that the session cluster reconciliation is resilient to this, 
I do not fully understand why this error occurs at the moment.

These fields in the status of the CR are only modified by the operator itself.

> Error during error status handling
> ----------------------------------
>
>                 Key: FLINK-33108
>                 URL: https://issues.apache.org/jira/browse/FLINK-33108
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>            Reporter: Gabor Somogyi
>            Priority: Major
>
> e2e_ci (v1_13, flink, native, test_multi_sessionjob.sh) failed with the 
> following issue:
> {code:java}
> Error: m2023-09-18 08:26:41,813 i.j.o.p.e.ReconciliationDispatcher 
> [ERROR][flink/session-cluster-1] Error during error status handling.
> org.apache.flink.kubernetes.operator.exception.StatusConflictException: 
> Status have been modified externally in version 1374 Previous: 
> {"jobStatus":{"jobName":null,"jobId":null,"state":null,"startTime":null,"updateTime":null,"savepointInfo":{"lastSavepoint":null,"triggerId":null,"triggerTimestamp":null,"triggerType":null,"formatType":null,"savepointHistory":[],"lastPeriodicSavepointTimestamp":0},"checkpointInfo":{"lastCheckpoint":null,"triggerId":null,"triggerTimestamp":null,"triggerType":null,"formatType":null,"lastPeriodicCheckpointTimestamp":0}},"error":null,"lifecycleState":"STABLE","clusterInfo":{"total-cpu":"0.25","flink-version":"1.13.6","flink-revision":"b2ca390
>  @ 
> 2022-02-03T14:54:22+01:00","total-memory":"1073741824"},"jobManagerDeploymentStatus":"READY","reconciliationStatus":{"reconciliationTimestamp":1695025410957,"lastReconciledSpec":"{\"spec\":{\"job\":null,\"restartNonce\":null,\"flinkConfiguration\":{\"high-availability\":\"org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory\",\"high-availability.storageDir\":\"file:///opt/flink/volume/flink-ha\",\"state.checkpoints.dir\":\"file:///opt/flink/volume/flink-cp\",\"state.savepoints.dir\":\"file:///opt/flink/volume/flink-sp\",\"taskmanager.numberOfTaskSlots\":\"2\"},\"image\":\"flink:1.13\",\"imagePullPolicy\":null,\"serviceAccount\":\"flink\",\"flinkVersion\":\"v1_13\",\"ingress\":{\"template\":\"/{{namespace}}/{{name}}(/|$)(.*)\",\"className\":\"nginx\",\"annotations\":{\"nginx.ingress.kubernetes.io/rewrite-target\":\"/$2\"}},\"podTemplate\":{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"metadata\":{\"name\":\"pod-template\"},\"spec\":{\"containers\":[{\"name\":\"flink-main-container\",\"resources\":{\"limits\":{\"ephemeral-storage\":\"2048Mi\"},\"requests\":{\"ephemeral-storage\":\"2048Mi\"}},\"volumeMounts\":[{\"mountPath\":\"/opt/flink/volume\",\"name\":\"flink-volume\"}]}],\"volumes\":[{\"name\":\"flink-volume\",\"persistentVolumeClaim\":{\"claimName\":\"session-cluster-1-pvc\"}}]}},\"jobManager\":{\"resource\":{\"cpu\":0.25,\"memory\":\"1024m\",\"ephemeralStorage\":null},\"replicas\":1,\"podTemplate\":null},\"taskManager\":{\"resource\":{\"cpu\":0.25,\"memory\":\"1024m\",\"ephemeralStorage\":null},\"replicas\":null,\"podTemplate\":null},\"logConfiguration\":null,\"mode\":\"native\"},\"resource_metadata\":{\"apiVersion\":\"flink.apache.org/v1beta1\",\"metadata\":{\"generation\":2},\"firstDeployment\":true}}","lastStableSpec":"{\"spec\":{\"job\":null,\"restartNonce\":null,\"flinkConfiguration\":{\"high-availability\":\"org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory\",\"high-availability.storageDir\":\"file:///opt/flink/volume/flink-ha\",\"state.checkpoints.dir\":\"file:///opt/flink/volume/flink-cp\",\"state.savepoints.dir\":\"file:///opt/flink/volume/flink-sp\",\"taskmanager.numberOfTaskSlots\":\"2\"},\"image\":\"flink:1.13\",\"imagePullPolicy\":null,\"serviceAccount\":\"flink\",\"flinkVersion\":\"v1_13\",\"ingress\":{\"template\":\"/{{namespace}}/{{name}}(/|$)(.*)\",\"className\":\"nginx\",\"annotations\":{\"nginx.ingress.kubernetes.io/rewrite-target\":\"/$2\"}},\"podTemplate\":{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"metadata\":{\"name\":\"pod-template\"},\"spec\":{\"containers\":[{\"name\":\"flink-main-container\",\"resources\":{\"limits\":{\"ephemeral-storage\":\"2048Mi\"},\"requests\":{\"ephemeral-storage\":\"2048Mi\"}},\"volumeMounts\":[{\"mountPath\":\"/opt/flink/volume\",\"name\":\"flink-volume\"}]}],\"volumes\":[{\"name\":\"flink-volume\",\"persistentVolumeClaim\":{\"claimName\":\"session-cluster-1-pvc\"}}]}},\"jobManager\":{\"resource\":{\"cpu\":0.25,\"memory\":\"1024m\",\"ephemeralStorage\":null},\"replicas\":1,\"podTemplate\":null},\"taskManager\":{\"resource\":{\"cpu\":0.25,\"memory\":\"1024m\",\"ephemeralStorage\":null},\"replicas\":null,\"podTemplate\":null},\"logConfiguration\":null,\"mode\":\"native\"},\"resource_metadata\":{\"apiVersion\":\"flink.apache.org/v1beta1\",\"metadata\":{\"generation\":2},\"firstDeployment\":true}}","state":"DEPLOYED"},"taskManager":null}
>  Latest: 
> {"jobStatus":{"jobName":null,"jobId":null,"state":null,"startTime":null,"updateTime":null,"savepointInfo":{"lastSavepoint":null,"triggerId":null,"triggerTimestamp":null,"triggerType":null,"formatType":null,"savepointHistory":[],"lastPeriodicSavepointTimestamp":0},"checkpointInfo":{"lastCheckpoint":null,"triggerId":null,"triggerTimestamp":null,"triggerType":null,"formatType":null,"lastPeriodicCheckpointTimestamp":0}},"error":null,"lifecycleState":"STABLE","clusterInfo":{"flink-revision":"b2ca390
>  @ 
> 2022-02-03T14:54:22+01:00","flink-version":"1.13.6","total-cpu":"0.5","total-memory":"2147483648"},"jobManagerDeploymentStatus":"READY","reconciliationStatus":{"reconciliationTimestamp":1695025410957,"lastReconciledSpec":"{\"spec\":{\"job\":null,\"restartNonce\":null,\"flinkConfiguration\":{\"high-availability\":\"org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory\",\"high-availability.storageDir\":\"file:///opt/flink/volume/flink-ha\",\"state.checkpoints.dir\":\"file:///opt/flink/volume/flink-cp\",\"state.savepoints.dir\":\"file:///opt/flink/volume/flink-sp\",\"taskmanager.numberOfTaskSlots\":\"2\"},\"image\":\"flink:1.13\",\"imagePullPolicy\":null,\"serviceAccount\":\"flink\",\"flinkVersion\":\"v1_13\",\"ingress\":{\"template\":\"/{{namespace}}/{{name}}(/|$)(.*)\",\"className\":\"nginx\",\"annotations\":{\"nginx.ingress.kubernetes.io/rewrite-target\":\"/$2\"}},\"podTemplate\":{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"metadata\":{\"name\":\"pod-template\"},\"spec\":{\"containers\":[{\"name\":\"flink-main-container\",\"resources\":{\"limits\":{\"ephemeral-storage\":\"2048Mi\"},\"requests\":{\"ephemeral-storage\":\"2048Mi\"}},\"volumeMounts\":[{\"mountPath\":\"/opt/flink/volume\",\"name\":\"flink-volume\"}]}],\"volumes\":[{\"name\":\"flink-volume\",\"persistentVolumeClaim\":{\"claimName\":\"session-cluster-1-pvc\"}}]}},\"jobManager\":{\"resource\":{\"cpu\":0.25,\"memory\":\"1024m\",\"ephemeralStorage\":null},\"replicas\":1,\"podTemplate\":null},\"taskManager\":{\"resource\":{\"cpu\":0.25,\"memory\":\"1024m\",\"ephemeralStorage\":null},\"replicas\":null,\"podTemplate\":null},\"logConfiguration\":null,\"mode\":\"native\"},\"resource_metadata\":{\"apiVersion\":\"flink.apache.org/v1beta1\",\"metadata\":{\"generation\":2},\"firstDeployment\":true}}","lastStableSpec":"{\"spec\":{\"job\":null,\"restartNonce\":null,\"flinkConfiguration\":{\"high-availability\":\"org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory\",\"high-availability.storageDir\":\"file:///opt/flink/volume/flink-ha\",\"state.checkpoints.dir\":\"file:///opt/flink/volume/flink-cp\",\"state.savepoints.dir\":\"file:///opt/flink/volume/flink-sp\",\"taskmanager.numberOfTaskSlots\":\"2\"},\"image\":\"flink:1.13\",\"imagePullPolicy\":null,\"serviceAccount\":\"flink\",\"flinkVersion\":\"v1_13\",\"ingress\":{\"template\":\"/{{namespace}}/{{name}}(/|$)(.*)\",\"className\":\"nginx\",\"annotations\":{\"nginx.ingress.kubernetes.io/rewrite-target\":\"/$2\"}},\"podTemplate\":{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"metadata\":{\"name\":\"pod-template\"},\"spec\":{\"containers\":[{\"name\":\"flink-main-container\",\"resources\":{\"limits\":{\"ephemeral-storage\":\"2048Mi\"},\"requests\":{\"ephemeral-storage\":\"2048Mi\"}},\"volumeMounts\":[{\"mountPath\":\"/opt/flink/volume\",\"name\":\"flink-volume\"}]}],\"volumes\":[{\"name\":\"flink-volume\",\"persistentVolumeClaim\":{\"claimName\":\"session-cluster-1-pvc\"}}]}},\"jobManager\":{\"resource\":{\"cpu\":0.25,\"memory\":\"1024m\",\"ephemeralStorage\":null},\"replicas\":1,\"podTemplate\":null},\"taskManager\":{\"resource\":{\"cpu\":0.25,\"memory\":\"1024m\",\"ephemeralStorage\":null},\"replicas\":null,\"podTemplate\":null},\"logConfiguration\":null,\"mode\":\"native\"},\"resource_metadata\":{\"apiVersion\":\"flink.apache.org/v1beta1\",\"metadata\":{\"generation\":2},\"firstDeployment\":true}}","state":"DEPLOYED"},"taskManager":null}
> {code}
> Link: 
> https://github.com/apache/flink-kubernetes-operator/actions/runs/6219937225/job/16879006709?pr=676#logs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to