Re: [PR] [YUNIKORN-2837] Log & Send Events, Improve logging [yunikorn-core]

via GitHub Wed, 04 Sep 2024 09:17:40 -0700


pbacsko commented on code in PR #957:
URL: https://github.com/apache/yunikorn-core/pull/957#discussion_r1744077986



##########
pkg/scheduler/objects/allocation.go:
##########
@@ -565,3 +567,39 @@ func (a *Allocation) setUserQuotaCheckPassed() {
                a.askEvents.SendRequestFitsInUserQuota(a.allocationKey, 
a.applicationID, a.allocatedResource)
        }
 }
+
+func (a *Allocation) setPreemptionPreConditionsCheckPassed() {
+       a.Lock()
+       defer a.Unlock()
+       if a.preemptionPreConditionsCheckFailed {
+               a.preemptionPreConditionsCheckFailed = false
+               
a.askEvents.SendPreemptionPreConditionsCheckPassed(a.allocationKey, 
a.applicationID, a.allocatedResource)
+       }
+}
+
+func (a *Allocation) setPreemptionPreConditionsCheckFailed() {
+       a.Lock()
+       defer a.Unlock()
+       if !a.preemptionPreConditionsCheckFailed {
+               a.preemptionPreConditionsCheckFailed = true
+               
a.askEvents.SendPreemptionPreConditionsCheckFailed(a.allocationKey, 
a.applicationID, a.allocatedResource)
+       }
+}
+
+func (a *Allocation) setPreemptionQueueGuaranteesCheckPassed() {
+       a.Lock()
+       defer a.Unlock()
+       if a.preemptionQueueGuaranteesCheckFailed {
+               a.preemptionQueueGuaranteesCheckFailed = false
+               
a.askEvents.SendPreemptionQueueGuaranteesCheckPassed(a.allocationKey, 
a.applicationID, a.allocatedResource)
+       }
+}
+
+func (a *Allocation) setPreemptionQueueGuaranteesCheckFailed() {
+       a.Lock()
+       defer a.Unlock()
+       if !a.preemptionQueueGuaranteesCheckFailed {
+               a.preemptionQueueGuaranteesCheckFailed = true
+               
a.askEvents.SendPreemptionQueueGuaranteesCheckFailed(a.allocationKey, 
a.applicationID, a.allocatedResource)
+       }
+}

Review Comment:
   I think an acceptable approach is recording stuff like the allocation log 
for each request. That looks reasonable to me. We can have a type like 
`AllocationLog` and we define fixed causes about why the preemption failed:
   * "Preconditions checks failed" (`preemptor.CheckPreconditions()` false)
   * "No guarantee to free up resources" (`p.checkPreemptionQueueGuarantees()` 
false)
   * "Preemption shortfall" 
(`p.ask.GetAllocatedResource().StrictlyGreaterThanOnlyExisting()` false)
   
   Every time something fails, you increase a counter for it. You also count 
the number of total attempts.
   
   So I'm thinking sth like:
   ```
   // Mutable fields which need protection
   allocated           bool
   allocLog            map[string]*AllocationLogEntry
   preemptionLog map[string]*AllocationLogEntry  // new field
   preemptionAttempts int64  // new field
   preemptionTriggered bool
   preemptCheckTime    time.Time
   ...
   
   var (
     ErrPreemptionPrecondFailed = errors.New("Preconditions checks failed")
     ErrPreemptionNoGuarantee = errors.New("No guarantee to free up resources")
     ErrPreemptionShortfall = errors.New("Preemption shortfall")
   )
   
   ```
   Then in `application.go`:
   ```
                                if fullIterator != nil {
                                           request.IncPreemptionAttempts()
                                        if result, err := 
sa.tryPreemption(headRoom, preemptionDelay, request, fullIterator, false); err 
!= nil {
                                                // preemption occurred, and 
possibly reservation
                                                return result
                                        }
                                        request.LogPreemptionFailure(err) 
                                }
                        }
   ```
   
   Obviously you need to make sure that problems propagate back so you can log 
it. Then this whole thing can be exposed on the REST API just like the 
allocation log.
   
   It's obviously not as detailed as a separate log entry printed to the 
stdout, but I see this as a good compromise.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [YUNIKORN-2837] Log & Send Events, Improve logging [yunikorn-core]

Reply via email to