[
https://issues.apache.org/jira/browse/FLINK-31135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713796#comment-17713796
]
Zhihao Chen commented on FLINK-31135:
-------------------------------------
Hi [~Swathi Chandrashekar] , in my case, the state.checkpoints.num-retained for
our flink jobs is always set as 5, but looks like that's not respected tho.
Please see the code snippet from the flinkdeployment via
flink-kubernetes-operator.
{code:java}
// code placeholder
apiVersion: v1
items:
- apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
creationTimestamp: "2023-04-04T03:02:25Z"
finalizers:
- flinkdeployments.flink.apache.org/finalizer
generation: 2
labels:
instanceId: parked-logs-ingestion-16805773-a96408
jobName: parked-logs-ingestion-16805773
name: parked-logs-ingestion-16805773-a96408
namespace: parked-logs-ingestion-16805773-a96408
resourceVersion: "533476748"
uid: 182b9c7e-74cc-490b-8045-9fddaa7b8aa9
spec:
flinkConfiguration:
execution.checkpointing.externalized-checkpoint-retention:
RETAIN_ON_CANCELLATION
execution.checkpointing.interval: "60000"
execution.checkpointing.max-concurrent-checkpoints: "1"
execution.checkpointing.min-pause: 5s
execution.checkpointing.mode: EXACTLY_ONCE
execution.checkpointing.prefer-checkpoint-for-recovery: "true"
execution.checkpointing.timeout: 60min
high-availability:
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
high-availability.storageDir:
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/ha
jobmanager.memory.process.size: 1024m
metrics.reporter.stsd.factory.class:
org.apache.flink.metrics.statsd.StatsDReporterFactory
metrics.reporter.stsd.host: localhost
metrics.reporter.stsd.interval: 30 SECONDS
metrics.reporter.stsd.port: "8125"
metrics.reporters: stsd
metrics.scope.jm: jobmanager
metrics.scope.jm.job: jobmanager.<job_name>
metrics.scope.operator: taskmanager.<job_name>.<operator_name>
metrics.scope.task: taskmanager.<job_name>.<task_name>
metrics.scope.tm: taskmanager
metrics.scope.tm.job: taskmanager.<job_name>
metrics.system-resource: "true"
metrics.system-resource-probing-interval: "30000"
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: "2147483647"
state.backend: hashmap
state.checkpoint-storage: filesystem
state.checkpoints.dir:
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/checkpoints
state.checkpoints.num-retained: "5"
state.savepoints.dir:
s3://eureka-flink-data-prod/parked-logs-ingestion-16805773-a96408/savepoints
taskmanager.memory.managed.size: "0"
taskmanager.memory.network.fraction: "0.1"
taskmanager.memory.network.max: 1000m
taskmanager.memory.network.min: 64m
taskmanager.memory.process.size: 2048m
taskmanager.numberOfTaskSlots: "10"
web.cancel.enable: "false"
flinkVersion: v1_15
{code}
I got the same issue before we switched to the flink-kubenertes-operator. That
time we were use flink standalone deployment on Kubernetes. We set
state.checkpoints.num-retained as 5, but hit the same issue.
> ConfigMap DataSize went > 1 MB and cluster stopped working
> ----------------------------------------------------------
>
> Key: FLINK-31135
> URL: https://issues.apache.org/jira/browse/FLINK-31135
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.2.0
> Reporter: Sriram Ganesh
> Priority: Major
> Attachments: dump_cm.yaml
>
>
> I am Flink Operator to manage clusters. Flink version: 1.15.2. Flink jobs
> failed with the below error. It seems the config map size went beyond 1 MB
> (default size).
> Since it is managed by the operator and config maps are not updated with any
> manual intervention, I suspect it could be an operator issue.
>
> {code:java}
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
> executing: PUT at:
> https://<IP>/api/v1/namespaces/<NS>/configmaps/<job>-config-map. Message:
> ConfigMap "<job>-config-map" is invalid: []: Too long: must have at most
> 1048576 bytes. Received status: Status(apiVersion=v1, code=422,
> details=StatusDetails(causes=[StatusCause(field=[], message=Too long: must
> have at most 1048576 bytes, reason=FieldValueTooLong,
> additionalProperties={})], group=null, kind=ConfigMap, name=<job>-config-map,
> retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status,
> message=ConfigMap "<job>-config-map" is invalid: []: Too long: must have at
> most 1048576 bytes, metadata=ListMeta(_continue=null,
> remainingItemCount=null, resourceVersion=null, selfLink=null,
> additionalProperties={}), reason=Invalid, status=Failure,
> additionalProperties={}).
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:673)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:560)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:347)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:327)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:781)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:183)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:188)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:130)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:41)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:325)
> ~[flink-dist-1.15.2.jar:1.15.2]
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
> ~[?:?]
> ... 3 more {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)