navinvishy commented on PR #357: URL: https://github.com/apache/flink-kubernetes-operator/pull/357#issuecomment-2475381388
Hi @gyfora , we have a situation where nodes are periodically taken down for maintenance, causing jobs to restart. A few restarts of a job within a few hours often result in an increasing consumer lag. It appears that the local task recovery techniques described [here](https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/ops/state/large_state_tuning/#task-local-recovery) and [here](https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes/#enabling-local-recovery-across-pod-restarts) would address this. Since we use the flink kubernetes operator, I landed on this PR looking for ways to enable this in the operator. Happy to take this forward if necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
