[jira] [Comment Edited] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

SwathiChandrashekar (Jira) Tue, 02 May 2023 10:36:13 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718663#comment-17718663
 ]


SwathiChandrashekar edited comment on FLINK-31135 at 5/2/23 5:35 PM:
---------------------------------------------------------------------

[~zhihaochen] , looks flink was getting inconsistent state exception, while 
trying to write the checkpoint to S3. Hence, it was throwing the following 
error at all checkpoints and asking to clean manually and hence was not 
considered for cleanup.

{"@timestamp":"2023-04-26T23:53:30.704Z","ecs.version":"1.2.0","log.level":"WARN","message":"An
 error occurred while writing checkpoint 11217 to the underlying metadata 
store. Flink was not able to determine whether the metadata was successfully 
persisted. The corresponding state located at 
's3://eureka-flink-data-prod/parked-logs-ingestion-16818796-0c2923/checkpoints/07bfdfef145a87c2071965081aaff548/chk-11217'
 won't be discarded and needs to be cleaned up 
manually.","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"}

Link : 
[https://github.com/apache/flink/blob/release-1.15/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1389]

This error was populated for all the checkpoints due to state inconsistency 
which resulted in storing lot of checkpoints in S3, which eventually caused the 
size of the configMap > 1MB


was (Author: swathi chandrashekar):
[~zhihaochen] , looks flink was getting inconsistent state exception, while 
trying to write the checkpoint to S3. Hence, it was throwing the following 
error at all checkpoints and asking to clean manually and hence was not 
considered for cleanup.

{"@timestamp":"2023-04-26T23:53:30.704Z","ecs.version":"1.2.0","log.level":"WARN","message":"An
 error occurred while writing checkpoint 11217 to the underlying metadata 
store. Flink was not able to determine whether the metadata was successfully 
persisted. The corresponding state located at 
's3://eureka-flink-data-prod/parked-logs-ingestion-16818796-0c2923/checkpoints/07bfdfef145a87c2071965081aaff548/chk-11217'
 won't be discarded and needs to be cleaned up 
manually.","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"}

Link : 
https://github.com/apache/flink/blob/release-1.15/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1389

This error was populated for all the checkpoints due to some state 
inconsistency which resulted in storing lot of checkpoints in S3, which 
eventually caused the size of the configMap > 1MB

> ConfigMap DataSize went > 1 MB and cluster stopped working
> ----------------------------------------------------------
>
>                 Key: FLINK-31135
>                 URL: https://issues.apache.org/jira/browse/FLINK-31135
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.2.0
>            Reporter: Sriram Ganesh
>            Priority: Major
>         Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https://<IP>/api/v1/namespaces/<NS>/configmaps/<job>-config-map. Message: 
> ConfigMap "<job>-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=<job>-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "<job>-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

Reply via email to