[
https://issues.apache.org/jira/browse/FLINK-30315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643909#comment-17643909
]
Peter Vary commented on FLINK-30315:
------------------------------------
The {{ContainerStateWaiting}} contains the message that we want.
The issue is that:
- For {{ErrImagePull}} we have the correct message: {{Failed to pull image
"flink:1.14": rpc error: code = Unknown desc = context deadline exceeded}}
- For {{ImagePullBackOff}} we only have this message: {{Back-off pulling image
"flink:1.14"}} which is not that useful
Based on this, I think we have the following options:
# Throw {{DeploymentFailedException}} at {{ErrImagePull}} and add provide the
enhanced message. Cons: This throws an error on the first image pull error -
previously we retried at least once (I am not sure that this is that important
as we continue to monitor the state of the deployment and we act on the state
changes anyway)
# Store the message in the state and provide it when the ImagePullBackOff
failed
I would like to hear you opinions about the options, or I am interested in any
alternatives you have in mind.
Without any different opinions, I would go for option 1.
> Add more information about image pull failures to the operator log
> ------------------------------------------------------------------
>
> Key: FLINK-30315
> URL: https://issues.apache.org/jira/browse/FLINK-30315
> Project: Flink
> Issue Type: Improvement
> Components: Kubernetes Operator
> Reporter: Peter Vary
> Priority: Major
>
> When there is an image pull error, this is what we see in the operator log:
> {code:java}
> org.apache.flink.kubernetes.operator.exception.DeploymentFailedException:
> Back-off pulling image "flink:1.14"
> at
> org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.checkContainerBackoff(AbstractFlinkDeploymentObserver.java:194)
> at
> org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeJmDeployment(AbstractFlinkDeploymentObserver.java:150)
> at
> org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:84)
> at
> org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:55)
> at
> org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:56)
> at
> org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:32)
> at
> org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:113)
> at
> org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54)
> at
> io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:136)
> at
> io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:94)
> at
> org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80)
> at
> io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:93)
> at
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:130)
> at
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:110)
> at
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:81)
> at
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:54)
> at
> io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:406)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
> at java.base/java.lang.Thread.run(Unknown Source) {code}
> This is the information we have on kubernetes side:
> {code}
> Normal Scheduled 2m19s default-scheduler Successfully
> assigned
> default/quickstart-base-86787586cd-lb7j6 to minikube
> Warning Failed 20s kubelet Failed to pull
> image "flink:1.14": rpc error: code = Unknown desc = context deadline exceeded
> *Warning Failed 20s kubelet Error*:
> ErrImagePull
> Normal BackOff 19s kubelet Back-off pulling
> image "flink:1.14"
> *Warning Failed 19s kubelet Error*:
> ImagePullBackOff
> Normal Pulling 7s (x2 over 2m19s) kubelet Pulling image
> "flink:1.14"
> {code}
> It would be good to add the additional message (in this case {{Failed to pull
> image "flink:1.14": rpc error: code = Unknown desc = context deadline
> exceeded}}) to the message of the {{DeploymentFailedException}} for
> tracebility.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)