[ 
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718686#comment-17718686
 ] 

SwathiChandrashekar commented on FLINK-31135:
---------------------------------------------

[~zhihaochen] , looks like for

jobId: 07bfdfef145a87c2071965081aaff548 , it tried to recover the job according 
to JM logs and tried to recover the checkpoints 1012 - 1015. The pointers were 
present in configMap :

parked-logs-ingestion-16818796-0c2923-07bfdfef145a87c2071965081aaff548-config-map".

The configMap of this was already 1Mb and any checkpoint which is triggered 
failed in this case with the following error.

{"@timestamp":"2023-04-26T23:46:54.190Z","ecs.version":"1.2.0","log.level":"WARN","message":"An
 error occurred while writing checkpoint 11211 to the underlying metadata 
store. Flink was not able to determine whether the metadata was successfully 
persisted. The corresponding state located at 
's3://eureka-flink-data-prod/parked-logs-ingestion-16818796-0c2923/checkpoints/07bfdfef145a87c2071965081aaff548/{*}chk-11211'
 won't be discarded and needs to be cleaned up 
manually.{*}","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator"}
 
{"@timestamp":"2023-04-26T23:46:54.248Z","ecs.version":"1.2.0","log.level":"WARN","message":"{*}Error
 while processing AcknowledgeCheckpoint{*} 
message","process.thread.name":"jobmanager-io-thread-1","log.logger":"org.apache.flink.runtime.jobmaster.JobMaster","error.type":"org.apache.flink.runtime.checkpoint.CheckpointException","error.message":"Could
 not complete the pending checkpoint 11211. Failure reason: Failure to finalize 
checkpoint.","error.stack_trace":"org.apache.flink.runtime.checkpoint.CheckpointException:
 Could not complete the pending checkpoint 11211. Failure reason: Failure to 
finalize checkpoint.\n\tat 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.addCompletedCheckpointToStoreAndSubsumeOldest(CheckpointCoordinator.java:1404)\n\tat
 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1249)\n\tat
 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1134)\n\tat
 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)\n\tat
 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)\n\tat
 java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by: 
org.apache.flink.runtime.persistence.PossibleInconsistentStateException: 
io.fabric8.kubernetes.client.KubernetesClientException: *Failure executing: PUT 
at: 
https://10.32.228.1/api/v1/namespaces/parked-logs-ingestion-16818796-0c2923/configmaps/parked-logs-ingestion-16818796-0c2923-07bfdfef145a87c2071965081aaff548-config-map.
 Message: ConfigMap 
\"parked-logs-ingestion-16818796-0c2923-07bfdfef145a87c2071965081aaff548-config-map\"
 is invalid: []: Too long: must have at most 1048576 bytes. Received status: 
Status(apiVersion=v1, code=422, 
details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must have 
at most 1048576 bytes, reason=FieldValueTooLong*

These were the only logs present from the JM and from the log attached, couldnt 
find the logs when the previous checkpoints were taken.  As in this new JM 
logs, the JobId was recovered, so possible that there was a JM restart and 
JobId was also restarted.

 

> ConfigMap DataSize went > 1 MB and cluster stopped working
> ----------------------------------------------------------
>
>                 Key: FLINK-31135
>                 URL: https://issues.apache.org/jira/browse/FLINK-31135
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.2.0
>            Reporter: Sriram Ganesh
>            Priority: Major
>         Attachments: dump_cm.yaml, image-2023-04-19-09-48-19-089.png, 
> jobmanager_log.txt
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs 
> failed with the below error. It seems the config map size went beyond 1 MB 
> (default size). 
> Since it is managed by the operator and config maps are not updated with any 
> manual intervention, I suspect it could be an operator issue. 
>  
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: PUT at: 
> https://<IP>/api/v1/namespaces/<NS>/configmaps/<job>-config-map. Message: 
> ConfigMap "<job>-config-map" is invalid: []: Too long: must have at most 
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must 
> have at most 1048576 bytes, reason=FieldValueTooLong, 
> additionalProperties={})], group=null, kind=ConfigMap, name=<job>-config-map, 
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, 
> message=ConfigMap "<job>-config-map" is invalid: []: Too long: must have at 
> most 1048576 bytes, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
>  ~[flink-dist-1.15.2.jar:1.15.2]
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>  ~[?:?]
> ... 3 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to