Peter Bacsko created YUNIKORN-2629:
--------------------------------------
Summary: Adding a node can result in a deadlock
Key: YUNIKORN-2629
URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
Project: Apache YuniKorn
Issue Type: Bug
Components: shim - kubernetes
Reporter: Peter Bacsko
Assignee: Peter Bacsko
Adding a new node after Yunikorn state initialization can result in a deadlock.
The problem is that {{Context.addNode()}} holds a lock while we're waiting for
the {{NodeAccepted}} event:
{noformat}
dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode, func(event
interface{}) {
nodeEvent, ok := event.(CachedSchedulerNodeEvent)
if !ok {
return
}
[...] removed for clarity
wg.Done()
})
defer dispatcher.UnregisterEventHandler(handlerID,
dispatcher.EventTypeNode)
api := ctx.apiProvider.GetAPIs().SchedulerAPI
if err := api.UpdateNode(&si.NodeRequest{
Nodes: nodesToRegister,
RmID: schedulerconf.GetSchedulerConf().ClusterID,
}); err != nil {
log.Log(log.ShimContext).Error("Failed to register nodes",
zap.Error(err))
return nil, err
}
// wait for all responses to accumulate
wg.Wait() <--- shim gets stuck here
{noformat}
If tasks are being processed, then the dispatcher will try to retrieve the
evend handler, which is returned from Context:
{noformat}
go func() {
for {
select {
case event := <-getDispatcher().eventChan:
switch v := event.(type) {
case events.TaskEvent:
getEventHandler(EventTypeTask)(v) <---
eventually calls Context.getTask()
case events.ApplicationEvent:
getEventHandler(EventTypeApp)(v)
case events.SchedulerNodeEvent:
getEventHandler(EventTypeNode)(v)
{noformat}
Since {{addNode()}} is holding a write lock, the event processing loop gets
stuck.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]