[
https://issues.apache.org/jira/browse/YUNIKORN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yongjun Zhang closed YUNIKORN-1996.
-----------------------------------
Resolution: Invalid
> Change a log about queue update failure due to max capacity reached from Warn
> to Debug
> --------------------------------------------------------------------------------------
>
> Key: YUNIKORN-1996
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1996
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: core - scheduler
> Reporter: Yongjun Zhang
> Assignee: Yongjun Zhang
> Priority: Major
> Labels: pull-request-available
>
> We are seeing similar issue as in YUNIKORN-1985:
> Tons of logs (62k of them in 3 seconds for the same request) because the max
> capacity of a queue has reached,
> {code:java}
> log.Log(log.SchedApplication).Warn("queue update failed unexpectedly",
> zap.Error(err))
> {code}
> in
> {code:java}
> func (sa *Application) tryNode(node *Node, ask *AllocationAsk) *Allocation {
> ...
> // everything OK really allocate
> alloc := NewAllocation(common.GetNewUUID(), node.NodeID, ask)
> if node.AddAllocation(alloc) {
> if err := sa.queue.IncAllocatedResource(alloc.GetAllocatedResource(),
> false); err != nil {
> log.Log(log.SchedApplication).Warn("queue update failed unexpectedly",
> zap.Error(err))
> // revert the node update
> node.RemoveAllocation(alloc.GetUUID())
> return nil
> }{code}
> I strongly suspect it’s simply because Yunikorn is trying a lot of nodes
> again and again, without being aware that the queue capacity exceeded, thus
> doing unnecessary work (because each try at that time is going to fail due to
> max capacity reached)
> This certainly would impact Yunikorn’s performance.
> I guess we need to introduce a categories of exceptions (MaxQueueCapReached,
> RequiredNodeUnavailable etc) that require delay before retry, and let the
> upper stack to catch the exception, put the allocation into a queue or
> something similar, and wait for certain period of time before retrying.
> But as a first step, we can just change the log to Debug level. Since the UI
> provide a way to check how much resource a given queue is used, and whether
> it's at its max capacity reached, we don't lose too much diagnosis capability
> after changing the log to Debug.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]