[
https://issues.apache.org/jira/browse/YUNIKORN-936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448620#comment-17448620
]
Peter Bacsko commented on YUNIKORN-936:
---------------------------------------
[~wilfreds] [~yuchaoran2011]
I untangled what happens during recovery. Not saying it was easy... :) But it
was worth it!
{noformat}
KubernetesShim.recoverSchedulerState() [shim]
AppManagementService.waitForRecovery() [shim]
AppManagementService.recoverApps() [shim] <for each app>
Application.TriggerAppRecovery() [shim]
Application.handleRecoverApplicationEvent() [shim]
RMProxy.UpdateApplication() [core]
Scheduler.handleRMEvent() [core]
ClusterContext.handleRMUpdateApplicationEvent() [core]
RMProxy.handleRMEvents() [core]
RMProxy.processApplicationUpdateEvent() [core]
AsyncRMCallback.UpdateApplication() [shim]
...
{noformat}
A lot of event handling happens on separate goroutines, so in
{{AppManagementService.waitForRecovery()}} we return earlier from
{{recoverApps()}}. Then we call {{AppManagementService.waitForAppRecovery()}},
which waits until all applications transition into {{Accepted}} state.
Only at this point we go ahead and perform node recovery.
{noformat}
KubernetesShim.recoverSchedulerState() [shim]
Context.WaitForRecovery() [shim]
Context.recover() [shim]
utils.WaitForCondition() [shim]
<anon function> [shim]
<dispatcher call> [shim]
SchedulerNode.handle() [shim]
SchedulerNode.handleNodeRecovery() [shim]
RMProxy.UpdateNode() [core]
... <skip some call sites>
PartitionContext.AddNode() [core]
PartitionContext.AddAllocation() [core]
{noformat}
In the unit test, I could reproduce the issue when I removed the line
{{ms.mockRM.waitForAcceptedApplication(t, appID1, 1000)}} and the problem
occurred in {{AddAllocation()}} - at that point, the application wasn't stored
yet. But as far as I can see it, this cannot happen during the normal
production code path, because by the time we get there, the application has
been added by {{ClusterContext.handleRMUpdateApplicationEvent()}}.
> app and node recovery event ordering
> ------------------------------------
>
> Key: YUNIKORN-936
> URL: https://issues.apache.org/jira/browse/YUNIKORN-936
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - common
> Reporter: Wilfred Spiegelenburg
> Assignee: Peter Bacsko
> Priority: Major
>
> While working on YUNIKORN-905 a number of unit tests failed due to event
> ordering. Looking at the change we might have had an issue in the RMProxy for
> a long time.
> An update request could contain apps, asks and nodes. Processing was ordered
> like that too. During recovery the order was/is important. There was never an
> order requirement on the events send by a shim or a use of complex updates
> events to support this ordering by the shim.
> An event to recover a node could be a separate UpdateRequest from the
> applications that should be recovered. That means we relied on the go routine
> and event ordering to hopefully do things correctly: i.e. events send by the
> shim to create new apps would be processed before node recovery started. Even
> in the previous implementation there was no guarantee that all the
> application were added before a node was recovered. The unit tests in the
> core used the order processing dependency to make sure it worked.
> That is not the real world scenario. and thus a dangerous assumption.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]