[ 
https://issues.apache.org/jira/browse/YUNIKORN-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen updated YUNIKORN-1596:
------------------------------
    Description: 
We are seeing a behavior when a scheduled pod requesting for PVC times out, its 
marked as unschedulable. There are no retries on such pod and remain in 
'pending' state. With pods in pending, autoscaler does not scale down nodes.  
This seems similar to issue discussed here:

[https://github.com/kubernetes/autoscaler/issues/3409]

 
{quote}Error from Yunikorn logs :
ERROR cache/context.go:527 Failed to bind pod volumes \{"podName": "<PODNAME>", 
"nodeName": "<IP>", "dynamicProvisions": 1, "staticBindings": 0}
...
...
/workspace/pkg/cache/task.go:382
2023-02-20T00:02:22.368Z ERROR cache/task.go:265 task failed \{"appID": 
"<APPID>", "taskID": "45981d91-e543-459b-9657-bdc03b57e26f", "reason": "bind 
pod volumes failed, name: <NS/PODNAME>, binding volumes: timed out waiting for 
the condition”}
{{}}
{quote}
 

{{From Autoscalar logs}}
{quote}I0220 20:47:01.775653 1 static_autoscaler.go:502] Scale down status: 
unneededOnly=true lastScaleUpTime=2023-02-20 19:20:56.429598603 +0000 UTC 
m=+249612.380355315 lastScaleDownDeleteTime=2023-02-20 06:36:50.929515212 +0000 
UTC m=+203766.880271921 lastScaleDownFailTime=2023-02-17 22:01:33.693397034 
+0000 UTC m=+49.644153730 scaleDownForbidden=true isDeleteInProgress=false 
scaleDownInCooldown=true
I0220 20:47:11.787999 1 static_autoscaler.go:228] Starting main loop
I0220 20:47:11.792789 1 filter_out_schedulable.go:65] Filtering out schedulables
I0220 20:47:11.792953 1 scheduler_binder.go:829] All bound volumes for Pod 
"<podname>" match with Node <node>"
I0220 20:47:11.792981 1 filter_out_schedulable.go:118] Pod <podname> marked as 
unschedulable can be scheduled on node <node> (based on hinting). Ignoring in 
scale up.
 
{quote} # Can Yunikorn introduce retries for such scenarios?
 # Can pods be set to error state after retries?

{{Note: pod name, nodename and ip masked above}}

  was:
We are seeing a behavior when a scheduled pod requesting for PVC times out, its 
marked as unschedulable. There are no retries on such pod and remain in 
'pending' state. With pods in pending, autoscaler does not scale down nodes.  
This seems similar to issue discussed here:

[https://github.com/kubernetes/autoscaler/issues/3409]

 

Error from Yunikorn logs :
ERROR cache/context.go:527 Failed to bind pod volumes \{"podName": "<PODNAME>", 
"nodeName": "<IP>", "dynamicProvisions": 1, "staticBindings": 0}
...
...
/workspace/pkg/cache/task.go:382
2023-02-20T00:02:22.368Z ERROR cache/task.go:265 task failed \{"appID": 
"<APPID>", "taskID": "45981d91-e543-459b-9657-bdc03b57e26f", "reason": "bind 
pod volumes failed, name: <NS/PODNAME>, binding volumes: timed out waiting for 
the condition”}
{{From Autoscalar logs}}
I0220 20:47:01.775653       1 static_autoscaler.go:502] Scale down status: 
unneededOnly=true lastScaleUpTime=2023-02-20 19:20:56.429598603 +0000 UTC 
m=+249612.380355315 lastScaleDownDeleteTime=2023-02-20 06:36:50.929515212 +0000 
UTC m=+203766.880271921 lastScaleDownFailTime=2023-02-17 22:01:33.693397034 
+0000 UTC m=+49.644153730 scaleDownForbidden=true isDeleteInProgress=false 
scaleDownInCooldown=true
I0220 20:47:11.787999       1 static_autoscaler.go:228] Starting main loop
I0220 20:47:11.792789       1 filter_out_schedulable.go:65] Filtering out 
schedulables
I0220 20:47:11.792953       1 scheduler_binder.go:829] All bound volumes for 
Pod "<podname>" match with Node <node>"
I0220 20:47:11.792981       1 filter_out_schedulable.go:118] Pod <podname> 
marked as unschedulable can be scheduled on node <node> (based on hinting). 
Ignoring in scale up.
 
 # Can Yunikorn introduce retries for such scenarios?
 # Can pods be set to error state after retries?

{{Note: pod name, nodename and ip masked above}}


> Pods marked unschedulable when dynamic PVC times out
> ----------------------------------------------------
>
>                 Key: YUNIKORN-1596
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1596
>             Project: Apache YuniKorn
>          Issue Type: Bug
>            Reporter: Praveen
>            Priority: Major
>
> We are seeing a behavior when a scheduled pod requesting for PVC times out, 
> its marked as unschedulable. There are no retries on such pod and remain in 
> 'pending' state. With pods in pending, autoscaler does not scale down nodes.  
> This seems similar to issue discussed here:
> [https://github.com/kubernetes/autoscaler/issues/3409]
>  
> {quote}Error from Yunikorn logs :
> ERROR cache/context.go:527 Failed to bind pod volumes \{"podName": 
> "<PODNAME>", "nodeName": "<IP>", "dynamicProvisions": 1, "staticBindings": 0}
> ...
> ...
> /workspace/pkg/cache/task.go:382
> 2023-02-20T00:02:22.368Z ERROR cache/task.go:265 task failed \{"appID": 
> "<APPID>", "taskID": "45981d91-e543-459b-9657-bdc03b57e26f", "reason": "bind 
> pod volumes failed, name: <NS/PODNAME>, binding volumes: timed out waiting 
> for the condition”}
> {{}}
> {quote}
>  
> {{From Autoscalar logs}}
> {quote}I0220 20:47:01.775653 1 static_autoscaler.go:502] Scale down status: 
> unneededOnly=true lastScaleUpTime=2023-02-20 19:20:56.429598603 +0000 UTC 
> m=+249612.380355315 lastScaleDownDeleteTime=2023-02-20 06:36:50.929515212 
> +0000 UTC m=+203766.880271921 lastScaleDownFailTime=2023-02-17 
> 22:01:33.693397034 +0000 UTC m=+49.644153730 scaleDownForbidden=true 
> isDeleteInProgress=false scaleDownInCooldown=true
> I0220 20:47:11.787999 1 static_autoscaler.go:228] Starting main loop
> I0220 20:47:11.792789 1 filter_out_schedulable.go:65] Filtering out 
> schedulables
> I0220 20:47:11.792953 1 scheduler_binder.go:829] All bound volumes for Pod 
> "<podname>" match with Node <node>"
> I0220 20:47:11.792981 1 filter_out_schedulable.go:118] Pod <podname> marked 
> as unschedulable can be scheduled on node <node> (based on hinting). Ignoring 
> in scale up.
>  
> {quote} # Can Yunikorn introduce retries for such scenarios?
>  # Can pods be set to error state after retries?
> {{Note: pod name, nodename and ip masked above}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to