[
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wilfred Spiegelenburg updated YUNIKORN-2629:
--------------------------------------------
Attachment: updateNode_deadlock_trace.txt
> Adding a node can result in a deadlock
> --------------------------------------
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: shim - kubernetes
> Affects Versions: 1.5.0
> Reporter: Peter Bacsko
> Assignee: Peter Bacsko
> Priority: Blocker
> Attachments: updateNode_deadlock_trace.txt
>
>
> Adding a new node after Yunikorn state initialization can result in a
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting
> for the {{NodeAccepted}} event:
> {noformat}
> dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode,
> func(event interface{}) {
> nodeEvent, ok := event.(CachedSchedulerNodeEvent)
> if !ok {
> return
> }
> [...] removed for clarity
> wg.Done()
> })
> defer dispatcher.UnregisterEventHandler(handlerID,
> dispatcher.EventTypeNode)
> if err :=
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{
> Nodes: nodesToRegister,
> RmID: schedulerconf.GetSchedulerConf().ClusterID,
> }); err != nil {
> log.Log(log.ShimContext).Error("Failed to register nodes",
> zap.Error(err))
> return nil, err
> }
> // wait for all responses to accumulate
> wg.Wait() <--- shim gets stuck here
> {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the
> evend handler, which is returned from Context:
> {noformat}
> go func() {
> for {
> select {
> case event := <-getDispatcher().eventChan:
> switch v := event.(type) {
> case events.TaskEvent:
> getEventHandler(EventTypeTask)(v) <---
> eventually calls Context.getTask()
> case events.ApplicationEvent:
> getEventHandler(EventTypeApp)(v)
> case events.SchedulerNodeEvent:
> getEventHandler(EventTypeNode)(v)
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets
> stuck, so {{registerNodes()}} will never progress.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]