[
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Bacsko updated YUNIKORN-2629:
-----------------------------------
Description:
Adding a new node after Yunikorn state initialization can result in a deadlock.
The problem is that {{Context.addNode()}} holds a lock while we're waiting for
the {{NodeAccepted}} event:
{noformat}
dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode,
func(event interface{}) {
nodeEvent, ok := event.(CachedSchedulerNodeEvent)
if !ok {
return
}
[...] removed for clarity
wg.Done()
})
defer dispatcher.UnregisterEventHandler(handlerID,
dispatcher.EventTypeNode)
api := ctx.apiProvider.GetAPIs().SchedulerAPI
if err := api.UpdateNode(&si.NodeRequest{
Nodes: nodesToRegister,
RmID: schedulerconf.GetSchedulerConf().ClusterID,
}); err != nil {
log.Log(log.ShimContext).Error("Failed to register nodes",
zap.Error(err))
return nil, err
}
// wait for all responses to accumulate
wg.Wait() <--- shim gets stuck here
{noformat}
If tasks are being processed, then the dispatcher will try to retrieve the
evend handler, which is returned from Context:
{noformat}
go func() {
for {
select {
case event := <-getDispatcher().eventChan:
switch v := event.(type) {
case events.TaskEvent:
getEventHandler(EventTypeTask)(v) <---
eventually calls Context.getTask()
case events.ApplicationEvent:
getEventHandler(EventTypeApp)(v)
case events.SchedulerNodeEvent:
getEventHandler(EventTypeNode)(v)
{noformat}
Since {{addNode()}} is holding a write lock, the event processing loop gets
stuck, so {{registerNodes()}} will never progress.
was:
Adding a new node after Yunikorn state initialization can result in a deadlock.
The problem is that {{Context.addNode()}} holds a lock while we're waiting for
the {{NodeAccepted}} event:
{noformat}
dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode,
func(event interface{}) {
nodeEvent, ok := event.(CachedSchedulerNodeEvent)
if !ok {
return
}
[...] removed for clarity
wg.Done()
})
defer dispatcher.UnregisterEventHandler(handlerID,
dispatcher.EventTypeNode)
api := ctx.apiProvider.GetAPIs().SchedulerAPI
if err := api.UpdateNode(&si.NodeRequest{
Nodes: nodesToRegister,
RmID: schedulerconf.GetSchedulerConf().ClusterID,
}); err != nil {
log.Log(log.ShimContext).Error("Failed to register nodes",
zap.Error(err))
return nil, err
}
// wait for all responses to accumulate
wg.Wait() <--- shim gets stuck here
{noformat}
If tasks are being processed, then the dispatcher will try to retrieve the
evend handler, which is returned from Context:
{noformat}
go func() {
for {
select {
case event := <-getDispatcher().eventChan:
switch v := event.(type) {
case events.TaskEvent:
getEventHandler(EventTypeTask)(v) <---
eventually calls Context.getTask()
case events.ApplicationEvent:
getEventHandler(EventTypeApp)(v)
case events.SchedulerNodeEvent:
getEventHandler(EventTypeNode)(v)
{noformat}
Since {{addNode()}} is holding a write lock, the event processing loop gets
stuck.
> Adding a node can result in a deadlock
> --------------------------------------
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: shim - kubernetes
> Affects Versions: 1.5.0
> Reporter: Peter Bacsko
> Assignee: Peter Bacsko
> Priority: Blocker
>
> Adding a new node after Yunikorn state initialization can result in a
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting
> for the {{NodeAccepted}} event:
> {noformat}
> dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode,
> func(event interface{}) {
> nodeEvent, ok := event.(CachedSchedulerNodeEvent)
> if !ok {
> return
> }
> [...] removed for clarity
> wg.Done()
> })
> defer dispatcher.UnregisterEventHandler(handlerID,
> dispatcher.EventTypeNode)
> api := ctx.apiProvider.GetAPIs().SchedulerAPI
> if err := api.UpdateNode(&si.NodeRequest{
> Nodes: nodesToRegister,
> RmID: schedulerconf.GetSchedulerConf().ClusterID,
> }); err != nil {
> log.Log(log.ShimContext).Error("Failed to register nodes",
> zap.Error(err))
> return nil, err
> }
> // wait for all responses to accumulate
> wg.Wait() <--- shim gets stuck here
> {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the
> evend handler, which is returned from Context:
> {noformat}
> go func() {
> for {
> select {
> case event := <-getDispatcher().eventChan:
> switch v := event.(type) {
> case events.TaskEvent:
> getEventHandler(EventTypeTask)(v) <---
> eventually calls Context.getTask()
> case events.ApplicationEvent:
> getEventHandler(EventTypeApp)(v)
> case events.SchedulerNodeEvent:
> getEventHandler(EventTypeNode)(v)
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets
> stuck, so {{registerNodes()}} will never progress.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]