[ 
https://issues.apache.org/jira/browse/YUNIKORN-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko resolved YUNIKORN-3123.
------------------------------------
    Fix Version/s: 1.8.0
       Resolution: Fixed

> Add retry logic to AssumePod to prevent PV races
> ------------------------------------------------
>
>                 Key: YUNIKORN-3123
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3123
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.8.0
>
>
> Internally we ran into a strange problem which occurs on OpenShift. It seems 
> to be related to how ephemeral volumes are handled by LSO (Local Storage 
> Operator).
> {noformat}
> │ Events:                                                                     
>                                                                               
>                              │
> │   Type     Reason          Age   From      Message                          
>                                                                               
>                              │
> │   ----     ------          ----  ----      -------                          
>                                                                               
>                              │
> │   Normal   Scheduling      22m   yunikorn  
> impala-1755495449-zgbl/impala-executor-000-0 is queued and waiting for 
> allocation                                                           │
> │   Warning  AssumePodError  22m   yunikorn  pod impala-executor-000-0 has 
> conflicting volume claims: node(s) didn't find available persistent volumes 
> to bind                           │
> │   Normal   TaskFailed      22m   yunikorn  Task 
> impala-1755495449-zgbl/impala-executor-000-0 is failed
> {noformat}
> The underlying issue is very likely a race condition between two separate 
> volumeBinder instances. The one inside the {{VolumeBinding}} plugin already 
> sees the volume when the predicates are evaluated, so the node is seen as fit 
> for the a given pod. After the core completes the scheduling, 
> {{context.AssumePod()}} is called with yet another call to
> {{SchedulerVolumeBinder.FindPodVolumes()}}. However, this instance hasn't 
> received the update about the volumes being ready, and it returns an error. 
> This also means that the bug is very sensitive to network latencies.
>  
> It's difficult to reproduce. Our suggestion is adding a simple retry logic 
> around {{AssumePod()}}.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org

Reply via email to