[jira] [Commented] (FLINK-30315) Add more information about image pull failures to the operator log

Peter Vary (Jira) Tue, 06 Dec 2022 07:05:34 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-30315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643909#comment-17643909
 ]


Peter Vary commented on FLINK-30315:
------------------------------------

The {{ContainerStateWaiting}} contains the message that we want.
The issue is that:
 - For {{ErrImagePull}} we have the correct message: {{Failed to pull image 
"flink:1.14": rpc error: code = Unknown desc = context deadline exceeded}}
 - For {{ImagePullBackOff}} we only have this message: {{Back-off pulling image 
"flink:1.14"}} which is not that useful

Based on this, I think we have the following options:
 # Throw {{DeploymentFailedException}} at {{ErrImagePull}} and add provide the 
enhanced message. Cons: This throws an error on the first image pull error - 
previously we retried at least once (I am not sure that this is that important 
as we continue to monitor the state of the deployment and we act on the state 
changes anyway)
 # Store the message in the state and provide it when the ImagePullBackOff 
failed

I would like to hear you opinions about the options, or I am interested in any 
alternatives you have in mind.



Without any different opinions, I would go for option 1.

> Add more information about image pull failures to the operator log
> ------------------------------------------------------------------
>
>                 Key: FLINK-30315
>                 URL: https://issues.apache.org/jira/browse/FLINK-30315
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>            Reporter: Peter Vary
>            Priority: Major
>
> When there is an image pull error, this is what we see in the operator log:
> {code:java}
> org.apache.flink.kubernetes.operator.exception.DeploymentFailedException: 
> Back-off pulling image "flink:1.14"
>       at 
> org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.checkContainerBackoff(AbstractFlinkDeploymentObserver.java:194)
>       at 
> org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeJmDeployment(AbstractFlinkDeploymentObserver.java:150)
>       at 
> org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:84)
>       at 
> org.apache.flink.kubernetes.operator.observer.deployment.AbstractFlinkDeploymentObserver.observeInternal(AbstractFlinkDeploymentObserver.java:55)
>       at 
> org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:56)
>       at 
> org.apache.flink.kubernetes.operator.observer.AbstractFlinkResourceObserver.observe(AbstractFlinkResourceObserver.java:32)
>       at 
> org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:113)
>       at 
> org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54)
>       at 
> io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:136)
>       at 
> io.javaoperatorsdk.operator.processing.Controller$1.execute(Controller.java:94)
>       at 
> org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:80)
>       at 
> io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:93)
>       at 
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:130)
>       at 
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:110)
>       at 
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:81)
>       at 
> io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:54)
>       at 
> io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:406)
>       at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)
>       at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)
>       at java.base/java.lang.Thread.run(Unknown Source) {code}
> This is the information we have on kubernetes side:
> {code}
> Normal   Scheduled  2m19s               default-scheduler  Successfully 
> assigned
> default/quickstart-base-86787586cd-lb7j6 to minikube
> Warning  Failed     20s                 kubelet            Failed to pull 
> image "flink:1.14": rpc error: code = Unknown desc = context deadline exceeded
> *Warning  Failed     20s                 kubelet            Error*: 
> ErrImagePull
> Normal   BackOff    19s                 kubelet            Back-off pulling 
> image "flink:1.14"
> *Warning  Failed     19s                 kubelet            Error*: 
> ImagePullBackOff
> Normal   Pulling    7s (x2 over 2m19s)  kubelet            Pulling image 
> "flink:1.14"
> {code}
> It would be good to add the additional message (in this case {{Failed to pull 
> image "flink:1.14": rpc error: code = Unknown desc = context deadline 
> exceeded}}) to the message of the {{DeploymentFailedException}} for 
> tracebility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-30315) Add more information about image pull failures to the operator log

Reply via email to