[
https://issues.apache.org/jira/browse/YUNIKORN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767346#comment-17767346
]
Yongjun Zhang commented on YUNIKORN-1996:
-----------------------------------------
The repeated log we saw:
{code:java}
WARN objects/application.go:1504 queue update failed unexpectedly
{“error”: “allocation (map[memory:37580963840 pods:1 vcore:2000]) puts queue
‘root.test-queue’ over maximum allocation (map[memory:3300011278336
vcore:390584]), current usage (map[memory:3291983380480 pods:91
vcore:186000])“}{code}
> Change a log about queue update failure due to max capacity reached from Warn
> to Debug
> --------------------------------------------------------------------------------------
>
> Key: YUNIKORN-1996
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1996
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - scheduler
> Reporter: Yongjun Zhang
> Assignee: Yongjun Zhang
> Priority: Major
> Labels: pull-request-available
>
> We are seeing similar issue as in YUNIKORN-1985:
> Tons of logs (62k of them in 3 seconds for the same request) because the max
> capacity of a queue has reached,
> {code:java}
> log.Log(log.SchedApplication).Warn("queue update failed unexpectedly",
> zap.Error(err))
> {code}
> in
> {code:java}
> func (sa *Application) tryNode(node *Node, ask *AllocationAsk) *Allocation {
> ...
> // everything OK really allocate
> alloc := NewAllocation(common.GetNewUUID(), node.NodeID, ask)
> if node.AddAllocation(alloc) {
> if err := sa.queue.IncAllocatedResource(alloc.GetAllocatedResource(),
> false); err != nil {
> log.Log(log.SchedApplication).Warn("queue update failed unexpectedly",
> zap.Error(err))
> // revert the node update
> node.RemoveAllocation(alloc.GetUUID())
> return nil
> }{code}
> I strongly suspect it’s simply because Yunikorn is trying a lot of nodes
> again and again, without being aware that the queue capacity exceeded, thus
> doing unnecessary work (because each try at that time is going to fail due to
> max capacity reached)
> This certainly would impact Yunikorn’s performance.
> I guess we need to introduce a categories of exceptions (MaxQueueCapReached,
> RequiredNodeUnavailable etc) that require delay before retry, and let the
> upper stack to catch the exception, put the allocation into a queue or
> something similar, and wait for certain period of time before retrying.
> But as a first step, we can just change the log to Debug level. Since the UI
> provide a way to check how much resource a given queue is used, and whether
> it's at its max capacity reached, we don't lose too much diagnosis capability
> after changing the log to Debug.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]