[ 
https://issues.apache.org/jira/browse/YUNIKORN-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YUNIKORN-2766:
-----------------------------------
    Description: 
Right now, we send an event to the pod if a predicate failed:

{noformat}
               if err := plugin.Predicates(&si.PredicatesArgs{
                        AllocationKey: allocationKey,
                        NodeID:        sn.NodeID,
                        Allocate:      allocate,
                }); err != nil {
                        log.Log(log.SchedNode).Debug("running predicates 
failed",
                                zap.String("allocationKey", allocationKey),
                                zap.String("nodeID", sn.NodeID),
                                zap.Bool("allocateFlag", allocate),
                                zap.Error(err))
                        // running predicates failed
                        msg := err.Error()
                        ask.LogAllocationFailure(msg, allocate)
                        ask.SendPredicateFailedEvent(msg)
                        return false
                }
{noformat}

This is, however, not correct. We should only generate an event if *all* 
predicates have failed, which means that the pod cannot be scheduled. A failing 
predicate for a given node can be perfectly normal in many cases.

Instead, we should aggregate the failed predicates and send an event like:

{noformat}
All predicates failed for request '345d70d7-243a-4077-a9f8-0bb76c3532d7': 
node(s) didn't match Pod's node affinity/selector (20x), node(s) had taints 
that the pod didn't tolerate (5x)
{noformat}

where 20x and 5x tell how many times a certain predicate failed.

  was:
Right now, we send an event to the pod if a predicate failed:

{noformat}
if err := plugin.Predicates(&si.PredicatesArgs{
                        AllocationKey: allocationKey,
                        NodeID:        sn.NodeID,
                        Allocate:      allocate,
                }); err != nil {
                        log.Log(log.SchedNode).Debug("running predicates 
failed",
                                zap.String("allocationKey", allocationKey),
                                zap.String("nodeID", sn.NodeID),
                                zap.Bool("allocateFlag", allocate),
                                zap.Error(err))
                        // running predicates failed
                        msg := err.Error()
                        ask.LogAllocationFailure(msg, allocate)
                        ask.SendPredicateFailedEvent(msg)
                        return false
                }
{noformat}

This is, however, not correct. We should only generate an event if *all* 
predicates have failed, which means that the pod cannot be scheduled. A failing 
predicate for a given node can be perfectly normal in many cases.

Instead, we should aggregate the failed predicates and send an event like:

{noformat}
All predicates failed for request '345d70d7-243a-4077-a9f8-0bb76c3532d7': 
node(s) didn't match Pod's node affinity/selector (20x), node(s) had taints 
that the pod didn't tolerate (5x)
{noformat}

where 20x and 5x tell how many times a certain predicate failed.


> Only generate event if all predicates failed
> --------------------------------------------
>
>                 Key: YUNIKORN-2766
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2766
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>
> Right now, we send an event to the pod if a predicate failed:
> {noformat}
>                if err := plugin.Predicates(&si.PredicatesArgs{
>                       AllocationKey: allocationKey,
>                       NodeID:        sn.NodeID,
>                       Allocate:      allocate,
>               }); err != nil {
>                       log.Log(log.SchedNode).Debug("running predicates 
> failed",
>                               zap.String("allocationKey", allocationKey),
>                               zap.String("nodeID", sn.NodeID),
>                               zap.Bool("allocateFlag", allocate),
>                               zap.Error(err))
>                       // running predicates failed
>                       msg := err.Error()
>                       ask.LogAllocationFailure(msg, allocate)
>                       ask.SendPredicateFailedEvent(msg)
>                       return false
>               }
> {noformat}
> This is, however, not correct. We should only generate an event if *all* 
> predicates have failed, which means that the pod cannot be scheduled. A 
> failing predicate for a given node can be perfectly normal in many cases.
> Instead, we should aggregate the failed predicates and send an event like:
> {noformat}
> All predicates failed for request '345d70d7-243a-4077-a9f8-0bb76c3532d7': 
> node(s) didn't match Pod's node affinity/selector (20x), node(s) had taints 
> that the pod didn't tolerate (5x)
> {noformat}
> where 20x and 5x tell how many times a certain predicate failed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to