Yongjun Zhang created YUNIKORN-1996:
---------------------------------------

             Summary: Change a log about queue update failure due to max 
capacity reached from Warn to Debug
                 Key: YUNIKORN-1996
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1996
             Project: Apache YuniKorn
          Issue Type: Improvement
          Components: core - scheduler
            Reporter: Yongjun Zhang
            Assignee: Yongjun Zhang


We are seeing similar issue as in YUNIKORN-1985:

Tons of logs  (62k of them in 3 seconds for the same request) because the max 
capacity of a queue has reached,
{code:java}
       log.Log(log.SchedApplication).Warn("queue update failed unexpectedly",
         zap.Error(err))
 {code}
in
{code:java}
 func (sa *Application) tryNode(node *Node, ask *AllocationAsk) *Allocation {
...
// everything OK really allocate
alloc := NewAllocation(common.GetNewUUID(), node.NodeID, ask)
if node.AddAllocation(alloc) {
   if err := sa.queue.IncAllocatedResource(alloc.GetAllocatedResource(), 
false); err != nil {
      log.Log(log.SchedApplication).Warn("queue update failed unexpectedly",
         zap.Error(err))
      // revert the node update
      node.RemoveAllocation(alloc.GetUUID())
      return nil
   }{code}
I strongly suspect it’s simply because Yunikorn is trying a lot of nodes again 
and again, without being aware that the queue capacity exceeded, thus doing 
unnecessary work (because each try at that time is going to fail due to max 
capacity reached)

This certainly would impact Yunikorn’s performance.

I guess we need to introduce a categories of exceptions (MaxQueueCapReached, 
RequiredNodeUnavailable etc) that require delay before retry, and let the upper 
stack to catch the exception, put the allocation into a queue or something 
similar, and wait for certain period of time before retrying.

But as a first step, we can just change the log to Debug level. Since the UI 
provide a way to check how much resource a given queue is used, and whether 
it's at its max capacity reached, we don't lose too much diagnosis capability 
after changing the log to Debug.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to