[ 
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847123#comment-17847123
 ] 

Wilfred Spiegelenburg commented on YUNIKORN-2629:
-------------------------------------------------

I think we need to look at the context lock in the k8shim in general.

The context lock is held while we do none context work. There is no need to 
hold the lock if all we do is waiting for a response that might trigger post 
processing or not.

> Adding a node can result in a deadlock
> --------------------------------------
>
>                 Key: YUNIKORN-2629
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>    Affects Versions: 1.5.0
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Blocker
>
> Adding a new node after Yunikorn state initialization can result in a 
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting 
> for the {{NodeAccepted}} event:
> {noformat}
>        dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, 
> func(event interface{}) {
>               nodeEvent, ok := event.(CachedSchedulerNodeEvent)
>               if !ok {
>                       return
>               }
>               [...] removed for clarity
>               wg.Done()
>       })
>       defer dispatcher.UnregisterEventHandler(handlerID, 
> dispatcher.EventTypeNode)
>       if err := 
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{
>               Nodes: nodesToRegister,
>               RmID:  schedulerconf.GetSchedulerConf().ClusterID,
>       }); err != nil {
>               log.Log(log.ShimContext).Error("Failed to register nodes", 
> zap.Error(err))
>               return nil, err
>       }
>       // wait for all responses to accumulate
>       wg.Wait()  <--- shim gets stuck here
>  {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the 
> evend handler, which is returned from Context:
> {noformat}
> go func() {
>               for {
>                       select {
>                       case event := <-getDispatcher().eventChan:
>                               switch v := event.(type) {
>                               case events.TaskEvent:
>                                       getEventHandler(EventTypeTask)(v)  <--- 
> eventually calls Context.getTask()
>                               case events.ApplicationEvent:
>                                       getEventHandler(EventTypeApp)(v)
>                               case events.SchedulerNodeEvent:
>                                       getEventHandler(EventTypeNode)(v)  
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets 
> stuck, so {{registerNodes()}} will never progress.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to