[
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718663#comment-17718663
]
SwathiChandrashekar edited comment on FLINK-31135 at 5/2/23 5:35 PM:
---------------------------------------------------------------------
[~zhihaochen] , looks flink was getting inconsistent state exception, while
trying to write the checkpoint to S3. Hence, it was throwing the following
error at all checkpoints and asking to clean manually and hence was not
considered for cleanup.
{"@timestamp":"2023-04-26T23:53:30.704Z","ecs.version":"1.2.0","log.level":"WARN","message":"An
error occurred while writing checkpoint 11217 to the underlying metadata
store. Flink was not able to determine whether the metadata was successfully
persisted. The corresponding state located at
's3://eureka-flink-data-prod/parked-logs-ingestion-16818796-0c2923/checkpoints/07bfdfef145a87c2071965081aaff548/chk-11217'
won't be discarded and needs to be cleaned up
manually.","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"}
Link :
[https://github.com/apache/flink/blob/release-1.15/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1389]
This error was populated for all the checkpoints due to state inconsistency
which resulted in storing lot of checkpoints in S3, which eventually caused the
size of the configMap > 1MB
was (Author: swathi chandrashekar):
[~zhihaochen] , looks flink was getting inconsistent state exception, while
trying to write the checkpoint to S3. Hence, it was throwing the following
error at all checkpoints and asking to clean manually and hence was not
considered for cleanup.
{"@timestamp":"2023-04-26T23:53:30.704Z","ecs.version":"1.2.0","log.level":"WARN","message":"An
error occurred while writing checkpoint 11217 to the underlying metadata
store. Flink was not able to determine whether the metadata was successfully
persisted. The corresponding state located at
's3://eureka-flink-data-prod/parked-logs-ingestion-16818796-0c2923/checkpoints/07bfdfef145a87c2071965081aaff548/chk-11217'
won't be discarded and needs to be cleaned up
manually.","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"}
Link :
https://github.com/apache/flink/blob/release-1.15/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1389
This error was populated for all the checkpoints due to some state
inconsistency which resulted in storing lot of checkpoints in S3, which
eventually caused the size of the configMap > 1MB
> ConfigMap DataSize went > 1 MB and cluster stopped working
> ----------------------------------------------------------
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.2.0
> Reporter: Sriram Ganesh
> Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png,
> jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs
> failed with the below error. It seems the config map size went beyond 1 MB
> (default size).
> Since it is managed by the operator and config maps are not updated with any
> manual intervention, I suspect it could be an operator issue.
>
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
> executing: PUT at:
> https://<IP>/api/v1/namespaces/<NS>/configmaps/<job>-config-map. Message:
> ConfigMap "<job>-config-map" is invalid: []: Too long: must have at most
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422,
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must
> have at most 1048576 bytes, reason=FieldValueTooLong,
> additionalProperties={})], group=null, kind=ConfigMap, name=<job>-config-map,
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status,
> message=ConfigMap "<job>-config-map" is invalid: []: Too long: must have at
> most 1048576 bytes, metadata=ListMeta(_continue=null,
> remainingItemCount=null, resourceVersion=null, selfLink=null,
> additionalProperties={}), reason=Invalid, status=Failure,
> additionalProperties={}).
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
> ~[?:?]
> ... 3 more {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)