[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

Zhihao Chen (Jira) Tue, 02 May 2023 20:54:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718781#comment-17718781
 ]


Zhihao Chen commented on FLINK-31135:
-------------------------------------

hey [~Swathi Chandrashekar], thank you for looking into it.
{quote}This error was populated for all the checkpoints due to state 
inconsistency which resulted in storing lot of checkpoints in S3, which 
eventually caused the size of the configMap > 1MB ]
{quote}
I don't think that's the case. Instead, the none of the checkpoint record in 
the CM was ever cleaned up. When the CM reaches the 1MB size limiation, the 
error "Flink was not able to determine whether the metadata was successfully 
persisted." starts to happen.


I have another flink job running here as an example.

Configmap: 
[^parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml]
The checkpoint ids are as "checkpointID-0000000000000000001", 
"checkpointID-0000000000000000002", ... "checkpointID-0000000000000001040". in 
a consecutive way. Worth reminding the IDs are from "1" to "1040". The 
configmap has reached the 1MB size limitation.
 
The "Flink was not able to determine whether the metadata was successfully 
persisted." actually happens when the CM attached the record "1040". Please see 
the logs below. The bottom one is first error log, which complains about the 
record "1041". I think that makes sense as it's not recorded in the CM, hence 
Flink can't determine if the metadata was successfully persisted.
!image-2023-05-03-13-47-51-440.png|width=1579,height=861!
 
The flink dashboard log also reflects the assumption.
!image-2023-05-03-13-51-21-685.png|width=1473,height=783!
 
 
My guess is that Flink never cleaned any of the record in CM at all for our 
cases.
 

> ConfigMap DataSize went > 1 MB and cluster stopped working
> ----------------------------------------------------------
>
>                 Key: FLINK-31135
>                 URL: https://issues.apache.org/jira/browse/FLINK-31135
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.2.0
>            Reporter: Sriram Ganesh
>            Priority: Major
>         Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> image-2023-05-03-13-47-51-440.png, image-2023-05-03-13-50-54-783.png, 
> image-2023-05-03-13-51-21-685.png, jobmanager_log.txt, 
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https://<IP>/api/v1/namespaces/<NS>/configmaps/<job>-config-map. Message: 
> ConfigMap "<job>-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=<job>-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "<job>-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-31135) ConfigMap DataSize went > 1 MB and cluster stopped working

Reply via email to