[
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718686#comment-17718686
]
SwathiChandrashekar commented on FLINK-31135:
---------------------------------------------
[~zhihaochen] , looks like for
jobId: 07bfdfef145a87c2071965081aaff548 , it tried to recover the job according
to JM logs and tried to recover the checkpoints 1012 - 1015. The pointers were
present in configMap :
parked-logs-ingestion-16818796-0c2923-07bfdfef145a87c2071965081aaff548-config-map".
The configMap of this was already 1Mb and any checkpoint which is triggered
failed in this case with the following error.
{"@timestamp":"2023-04-26T23:46:54.190Z","ecs.version":"1.2.0","log.level":"WARN","message":"An
error occurred while writing checkpoint 11211 to the underlying metadata
store. Flink was not able to determine whether the metadata was successfully
persisted. The corresponding state located at
's3://eureka-flink-data-prod/parked-logs-ingestion-16818796-0c2923/checkpoints/07bfdfef145a87c2071965081aaff548/{*}chk-11211'
won't be discarded and needs to be cleaned up
manually.{*}","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"}
{"@timestamp":"2023-04-26T23:46:54.248Z","ecs.version":"1.2.0","log.level":"WARN","message":"{*}Error
while processing AcknowledgeCheckpoint{*}
message","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.jobmaster.JobMaster","error.type":"org.apache.flink.runtime.checkpoint.CheckpointException","error.message":"Could
not complete the pending checkpoint 11211. Failure reason: Failure to finalize
checkpoint.","error.stack_trace":"org.apache.flink.runtime.checkpoint.CheckpointException:
Could not complete the pending checkpoint 11211. Failure reason: Failure to
finalize checkpoint.\n\tat
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.addCompletedCheckpointToStoreAndSubsumeOldest(CheckpointCoordinator.java:1404)\n\tat
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1249)\n\tat
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1134)\n\tat
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)\n\tat
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)\n\tat
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
Source)\n\tat
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by:
org.apache.flink.runtime.persistence.PossibleInconsistentStateException:
io.fabric8.kubernetes.client.KubernetesClientException: *Failure executing: PUT
at:
https://10.32.228.1/api/v1/namespaces/parked-logs-ingestion-16818796-0c2923/configmaps/parked-logs-ingestion-16818796-0c2923-07bfdfef145a87c2071965081aaff548-config-map.
Message: ConfigMap
\"parked-logs-ingestion-16818796-0c2923-07bfdfef145a87c2071965081aaff548-config-map\"
is invalid: []: Too long: must have at most 1048576 bytes. Received status:
Status(apiVersion=v1, code=422,
details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must have
at most 1048576 bytes, reason=FieldValueTooLong*
These were the only logs present from the JM and from the log attached, couldnt
find the logs when the previous checkpoints were taken. As in this new JM
logs, the JobId was recovered, so possible that there was a JM restart and
JobId was also restarted.
> ConfigMap DataSize went > 1 MB and cluster stopped working
> ----------------------------------------------------------
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.2.0
> Reporter: Sriram Ganesh
> Priority: Major
> Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png,
> jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs
> failed with the below error. It seems the config map size went beyond 1 MB
> (default size).
> Since it is managed by the operator and config maps are not updated with any
> manual intervention, I suspect it could be an operator issue.
>
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
> executing: PUT at:
> https://<IP>/api/v1/namespaces/<NS>/configmaps/<job>-config-map. Message:
> ConfigMap "<job>-config-map" is invalid: []: Too long: must have at most
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422,
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must
> have at most 1048576 bytes, reason=FieldValueTooLong,
> additionalProperties={})], group=null, kind=ConfigMap, name=<job>-config-map,
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status,
> message=ConfigMap "<job>-config-map" is invalid: []: Too long: must have at
> most 1048576 bytes, metadata=ListMeta(_continue=null,
> remainingItemCount=null, resourceVersion=null, selfLink=null,
> additionalProperties={}), reason=Invalid, status=Failure,
> additionalProperties={}).
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
> ~[?:?]
> ... 3 more {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)