[
https://issues.apache.org/jira/browse/YUNIKORN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17862849#comment-17862849
]
Peter Bacsko commented on YUNIKORN-2629:
----------------------------------------
[~jshmchenxi] thanks, this is the same problem that [~dimm] talked about under
YUNIKORN-2646. It's a bit difficult to say anything right now, because it might
not be a false positive after all.
If it happens again, could you do what I asked here:
https://issues.apache.org/jira/browse/YUNIKORN-2646?focusedCommentId=17856602&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17856602.
So basically get a goroutine dump to see where it got stuck and how. I'm
wondering if this is related to preemption. We need to isolate the part of the
code which causes it, find the root cause then come up with a solution.
> Adding a node can result in a deadlock
> --------------------------------------
>
> Key: YUNIKORN-2629
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2629
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: shim - kubernetes
> Affects Versions: 1.5.0
> Reporter: Peter Bacsko
> Assignee: Peter Bacsko
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.5.2
>
> Attachments: updateNode_deadlock_trace.txt,
> yunikorn-scheduler-20240627.log
>
>
> Adding a new node after Yunikorn state initialization can result in a
> deadlock.
> The problem is that {{Context.addNode()}} holds a lock while we're waiting
> for the {{NodeAccepted}} event:
> {noformat}
> dispatcher.RegisterEventHandler(handlerID, dispatcher.EventTypeNode,
> func(event interface{}) {
> nodeEvent, ok := event.(CachedSchedulerNodeEvent)
> if !ok {
> return
> }
> [...] removed for clarity
> wg.Done()
> })
> defer dispatcher.UnregisterEventHandler(handlerID,
> dispatcher.EventTypeNode)
> if err :=
> ctx.apiProvider.GetAPIs().SchedulerAPI.UpdateNode(&si.NodeRequest{
> Nodes: nodesToRegister,
> RmID: schedulerconf.GetSchedulerConf().ClusterID,
> }); err != nil {
> log.Log(log.ShimContext).Error("Failed to register nodes",
> zap.Error(err))
> return nil, err
> }
> // wait for all responses to accumulate
> wg.Wait() <--- shim gets stuck here
> {noformat}
> If tasks are being processed, then the dispatcher will try to retrieve the
> evend handler, which is returned from Context:
> {noformat}
> go func() {
> for {
> select {
> case event := <-getDispatcher().eventChan:
> switch v := event.(type) {
> case events.TaskEvent:
> getEventHandler(EventTypeTask)(v) <---
> eventually calls Context.getTask()
> case events.ApplicationEvent:
> getEventHandler(EventTypeApp)(v)
> case events.SchedulerNodeEvent:
> getEventHandler(EventTypeNode)(v)
> {noformat}
> Since {{addNode()}} is holding a write lock, the event processing loop gets
> stuck, so {{registerNodes()}} will never progress.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]