[jira] [Commented] (YUNIKORN-1996) Change a log about queue update failure due to max capacity reached from Warn to Debug

Yongjun Zhang (Jira) Wed, 20 Sep 2023 18:00:17 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767346#comment-17767346
 ]


Yongjun Zhang commented on YUNIKORN-1996:
-----------------------------------------

The repeated log we saw:
{code:java}
 WARN    objects/application.go:1504     queue update failed unexpectedly       
 {“error”: “allocation (map[memory:37580963840 pods:1 vcore:2000]) puts queue 
‘root.test-queue’ over maximum allocation (map[memory:3300011278336 
vcore:390584]), current usage (map[memory:3291983380480 pods:91 
vcore:186000])“}{code}
 

> Change a log about queue update failure due to max capacity reached from Warn 
> to Debug
> --------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-1996
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1996
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>            Priority: Major
>              Labels: pull-request-available
>
> We are seeing similar issue as in YUNIKORN-1985:
> Tons of logs  (62k of them in 3 seconds for the same request) because the max 
> capacity of a queue has reached,
> {code:java}
>        log.Log(log.SchedApplication).Warn("queue update failed unexpectedly",
>          zap.Error(err))
>  {code}
> in
> {code:java}
>  func (sa *Application) tryNode(node *Node, ask *AllocationAsk) *Allocation {
> ...
> // everything OK really allocate
> alloc := NewAllocation(common.GetNewUUID(), node.NodeID, ask)
> if node.AddAllocation(alloc) {
>    if err := sa.queue.IncAllocatedResource(alloc.GetAllocatedResource(), 
> false); err != nil {
>       log.Log(log.SchedApplication).Warn("queue update failed unexpectedly",
>          zap.Error(err))
>       // revert the node update
>       node.RemoveAllocation(alloc.GetUUID())
>       return nil
>    }{code}
> I strongly suspect it’s simply because Yunikorn is trying a lot of nodes 
> again and again, without being aware that the queue capacity exceeded, thus 
> doing unnecessary work (because each try at that time is going to fail due to 
> max capacity reached)
> This certainly would impact Yunikorn’s performance.
> I guess we need to introduce a categories of exceptions (MaxQueueCapReached, 
> RequiredNodeUnavailable etc) that require delay before retry, and let the 
> upper stack to catch the exception, put the allocation into a queue or 
> something similar, and wait for certain period of time before retrying.
> But as a first step, we can just change the log to Debug level. Since the UI 
> provide a way to check how much resource a given queue is used, and whether 
> it's at its max capacity reached, we don't lose too much diagnosis capability 
> after changing the log to Debug.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-1996) Change a log about queue update failure due to max capacity reached from Warn to Debug

Reply via email to