[ 
https://issues.apache.org/jira/browse/YUNIKORN-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17765435#comment-17765435
 ] 

Yongjun Zhang commented on YUNIKORN-1985:
-----------------------------------------

We are seeing similar issue:

Tons of logs  (62k of them in 3 seconds for the same request) because the max 
capacity of a queue has reached,
{code:java}
       log.Log(log.SchedApplication).Warn("queue update failed unexpectedly",
         zap.Error(err))
 {code}
in
{code:java}
 func (sa *Application) tryNode(node *Node, ask *AllocationAsk) *Allocation {
...
// everything OK really allocate
alloc := NewAllocation(common.GetNewUUID(), node.NodeID, ask)
if node.AddAllocation(alloc) {
   if err := sa.queue.IncAllocatedResource(alloc.GetAllocatedResource(), 
false); err != nil {
      log.Log(log.SchedApplication).Warn("queue update failed unexpectedly",
         zap.Error(err))
      // revert the node update
      node.RemoveAllocation(alloc.GetUUID())
      return nil
   }{code}
I strongly suspect it’s simply because Yunikorn is trying a lot of nodes again 
and again, without being aware that the queue capacity exceeded, thus doing 
unnecessary work (because each try at that time is going to fail due to max 
capacity reached)

This certainly would impact Yunikorn’s performance.

I guess we need o introduce a categories of exceptions (MaxQueueCapReached, 
RequiredNodeUnavailable etc) that require delay before retry, and let the upper 
stack to catch the exception, put the allocation into a queue or something 
similar, and wait for certain period of time before retrying.

Thanks.

 

> possible log spew in application object in tryAllocate()
> --------------------------------------------------------
>
>                 Key: YUNIKORN-1985
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1985
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Kuan-Po Tseng
>            Priority: Minor
>              Labels: newbie
>
> If a pod has a required node and cannot be scheduled we keep on attempting 
> the pod placement and cause a log spew:
> {code:java}
> 2023-06-15T10:42:43.546Z    WARN    objects/application.go:993    required 
> node is not found (could be transient)    {"application ID": 
> "yunikorn-ag-ecp12s-1-autogen", "allocationKey": 
> "b51b0a36-214a-4a12-a285-d4319f4b5254", "required node": "worker-31.XXXX"} 
> {code}
> This can happen if a daemonset pod is created and the node is not in the 
> right state.
> We should move this to debug or have a rate limit on how often we log this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to