[
https://issues.apache.org/jira/browse/YUNIKORN-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805845#comment-17805845
]
Peter Bacsko commented on YUNIKORN-2322:
----------------------------------------
Look at the stack traces. In the high latency case, you go from
{{Application.tryNode()}} to {{IncAllocatedResource()}} (not fully visible but
it can't be anything else) then to {{fmt.Errorf()}}. This can only happen if
{{Queue.IncAllocatedResource()}} returns an error. If it returns an error, you
should definitely see a warning, unless you guys set the logging level to ERROR.
{noformat}
if node.AddAllocation(alloc) {
if err :=
sa.queue.IncAllocatedResource(alloc.GetAllocatedResource(), false); err != nil {
log.Log(log.SchedApplication).Warn("queue update failed
unexpectedly",
zap.Error(err))
// revert the node update
node.RemoveAllocation(alloc.GetAllocationID())
return nil
}
{noformat}
> Investigate YuniKorn stuck when scheduling latency is high
> ----------------------------------------------------------
>
> Key: YUNIKORN-2322
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2322
> Project: Apache YuniKorn
> Issue Type: Task
> Components: core - common
> Reporter: Rainie Li
> Assignee: Rainie Li
> Priority: Major
> Attachments: Screenshot 2024-01-10 at 4.31.52 PM.png, Screenshot
> 2024-01-10 at 4.33.40 PM.png, Screenshot 2024-01-11 at 3.40.48 PM-1.png,
> Screenshot 2024-01-11 at 3.40.48 PM.png
>
>
> We are seeing service stuck when latency increases, even cluster has
> resource, YuniKorn will not be able to schedule apps. We have to manually
> restart YuniKorn.
> we did profiling to find out most time are used by *tryReservedAllocate.*
> Attached ** profiling screenshot and service latency data.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]