[ 
https://issues.apache.org/jira/browse/YUNIKORN-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-2550:
-----------------------------------
    Description: 
Possible deadlock was detected:

{noformat}
placement.(*AppPlacementManager).initialise { m.Lock() } <<<<<
placement.(*AppPlacementManager).initialise { } }
placement.(*AppPlacementManager).UpdateRules { 
log.Log(log.Config).Info("Building new rule list for placement manager") }
scheduler.(*PartitionContext).updatePartitionDetails { err := 
pc.placementManager.UpdateRules(conf.PlacementRules) }
scheduler.(*ClusterContext).updateSchedulerConfig { err = 
part.updatePartitionDetails(p) }
scheduler.(*ClusterContext).processRMConfigUpdateEvent { err = 
cc.updateSchedulerConfig(conf, rmID) }
scheduler.(*Scheduler).handleRMEvent { case *rmevent.RMConfigUpdateEvent: }

scheduler.(*PartitionContext).GetQueue { pc.RLock() } <<<<<
scheduler.(*PartitionContext).GetQueue { func (pc *PartitionContext) 
GetQueue(name string) *objects.Queue { }
placement.(*providedRule).placeApplication { // if we cannot create the queue 
must exist }
placement.(*AppPlacementManager).PlaceApplication { queueName, err = 
checkRule.placeApplication(app, m.queueFn) }
scheduler.(*PartitionContext).AddApplication { err := 
pc.getPlacementManager().PlaceApplication(app) }
scheduler.(*ClusterContext).handleRMUpdateApplicationEvent { schedApp := 
objects.NewApplication(app, ugi, cc.rmEventHandler, request.RmID) }
scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: }
{noformat}

Lock order is different between {{PartitionContext}} and 
{{AppPlacementManager}}.

  was:
Possible deadlock was detected:

{noformat}
~/repos/yunikorn-core/pkg/scheduler/partition.go:448 
scheduler.(*PartitionContext).GetQueue { pc.RLock() } <<<<<
~/repos/yunikorn-core/pkg/scheduler/partition.go:447 
scheduler.(*PartitionContext).GetQueue { func (pc *PartitionContext) 
GetQueue(name string) *objects.Queue { }
~/repos/yunikorn-core/pkg/scheduler/placement/provided_rule.go:107 
placement.(*providedRule).placeApplication { // if we cannot create the queue 
must exist }
~/repos/yunikorn-core/pkg/scheduler/placement/placement.go:125 
placement.(*AppPlacementManager).PlaceApplication { queueName, err = 
checkRule.placeApplication(app, m.queueFn) }
~/repos/yunikorn-core/pkg/scheduler/partition.go:309 
scheduler.(*PartitionContext).AddApplication { err := 
pc.getPlacementManager().PlaceApplication(app) }
~/repos/yunikorn-core/pkg/scheduler/context.go:523 
scheduler.(*ClusterContext).handleRMUpdateApplicationEvent { schedApp := 
objects.NewApplication(app, ugi, cc.rmEventHandler, request.RmID) }
~/repos/yunikorn-core/pkg/scheduler/scheduler.go:130 
scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: }

Lock order is different between {{PartitionContext}} and {{AppPlacementManager}}


> Fix locking in PartitionContext
> -------------------------------
>
>                 Key: YUNIKORN-2550
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2550
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: core - common
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>
> Possible deadlock was detected:
> {noformat}
> placement.(*AppPlacementManager).initialise { m.Lock() } <<<<<
> placement.(*AppPlacementManager).initialise { } }
> placement.(*AppPlacementManager).UpdateRules { 
> log.Log(log.Config).Info("Building new rule list for placement manager") }
> scheduler.(*PartitionContext).updatePartitionDetails { err := 
> pc.placementManager.UpdateRules(conf.PlacementRules) }
> scheduler.(*ClusterContext).updateSchedulerConfig { err = 
> part.updatePartitionDetails(p) }
> scheduler.(*ClusterContext).processRMConfigUpdateEvent { err = 
> cc.updateSchedulerConfig(conf, rmID) }
> scheduler.(*Scheduler).handleRMEvent { case *rmevent.RMConfigUpdateEvent: }
> scheduler.(*PartitionContext).GetQueue { pc.RLock() } <<<<<
> scheduler.(*PartitionContext).GetQueue { func (pc *PartitionContext) 
> GetQueue(name string) *objects.Queue { }
> placement.(*providedRule).placeApplication { // if we cannot create the queue 
> must exist }
> placement.(*AppPlacementManager).PlaceApplication { queueName, err = 
> checkRule.placeApplication(app, m.queueFn) }
> scheduler.(*PartitionContext).AddApplication { err := 
> pc.getPlacementManager().PlaceApplication(app) }
> scheduler.(*ClusterContext).handleRMUpdateApplicationEvent { schedApp := 
> objects.NewApplication(app, ugi, cc.rmEventHandler, request.RmID) }
> scheduler.(*Scheduler).handleRMEvent { case ev := <-s.pendingEvents: }
> {noformat}
> Lock order is different between {{PartitionContext}} and 
> {{AppPlacementManager}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to