[
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718781#comment-17718781
]
Zhihao Chen commented on FLINK-31135:
-------------------------------------
hey [~Swathi Chandrashekar], thank you for looking into it.
{quote}This error was populated for all the checkpoints due to state
inconsistency which resulted in storing lot of checkpoints in S3, which
eventually caused the size of the configMap > 1MB ]
{quote}
I don't think that's the case. Instead, the none of the checkpoint record in
the CM was ever cleaned up. When the CM reaches the 1MB size limiation, the
error "Flink was not able to determine whether the metadata was successfully
persisted." starts to happen.
I have another flink job running here as an example.
Configmap:
[^parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml]
The checkpoint ids are as "checkpointID-0000000000000000001",
"checkpointID-0000000000000000002", ... "checkpointID-0000000000000001040". in
a consecutive way. Worth reminding the IDs are from "1" to "1040". The
configmap has reached the 1MB size limitation.
The "Flink was not able to determine whether the metadata was successfully
persisted." actually happens when the CM attached the record "1040". Please see
the logs below. The bottom one is first error log, which complains about the
record "1041". I think that makes sense as it's not recorded in the CM, hence
Flink can't determine if the metadata was successfully persisted.
!image-2023-05-03-13-47-51-440.png|width=1579,height=861!
The flink dashboard log also reflects the assumption.
!image-2023-05-03-13-51-21-685.png|width=1473,height=783!
My guess is that Flink never cleaned any of the record in CM at all for our
cases.
> ConfigMap DataSize went > 1 MB and cluster stopped working
> ----------------------------------------------------------
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.2.0
> Reporter: Sriram Ganesh
> Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png,
> image-2023-05-03-13-47-51-440.png, image-2023-05-03-13-50-54-783.png,
> image-2023-05-03-13-51-21-685.png, jobmanager_log.txt,
> parked-logs-ingestion-644b80-3494e4c01b82eb7a75a76080974b41cd-config-map.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs
> failed with the below error. It seems the config map size went beyond 1 MB
> (default size).
> Since it is managed by the operator and config maps are not updated with any
> manual intervention, I suspect it could be an operator issue.
>
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
> executing: PUT at:
> https://<IP>/api/v1/namespaces/<NS>/configmaps/<job>-config-map. Message:
> ConfigMap "<job>-config-map" is invalid: []: Too long: must have at most
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422,
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must
> have at most 1048576 bytes, reason=FieldValueTooLong,
> additionalProperties={})], group=null, kind=ConfigMap, name=<job>-config-map,
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status,
> message=ConfigMap "<job>-config-map" is invalid: []: Too long: must have at
> most 1048576 bytes, metadata=ListMeta(_continue=null,
> remainingItemCount=null, resourceVersion=null, selfLink=null,
> additionalProperties={}), reason=Invalid, status=Failure,
> additionalProperties={}).
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
> ~[?:?]
> ... 3 more {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)