[jira] [Commented] (YUNIKORN-2926) The Pod using gang scheduling is stuck in the Pending state

wangzhihui (Jira) Wed, 16 Oct 2024 06:35:05 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890085#comment-17890085
 ]


wangzhihui commented on YUNIKORN-2926:
--------------------------------------

hi,[~wilfreds] 
I need to add more detailed explanations to this issue.

case example：
 * The task-group minResource is less than containers define resource size
{code:java}
// code placeholder
yunikorn.apache.org/task-groups: |-
          [{
              "name": "task-group-example",
              "minMember": 2,
              "minResource": {
                "cpu": "100m",
                "memory": "50M"
              },
              ...
          }]
containers:
        - name: sleep30
          image: "alpine:latest"
          command: ["sleep", "99999999"]
          resources:
            requests:
              cpu: "200m"
              memory: "50M" {code}

*alysis：*
 * After all the "placeholders" are applied to the resources, Then the 
"tryPlaceholderAllocate" logic will be triggered.
 * In "tryPlaceholderAllocate" have a a resource size matching check. If the 
real allocation is larger than all placeholders. The application will release 
all placeholder resources and it will exit gangScheduling.
 *  The terminationType here is TerminationType_TIMEOUT

{code:java}
// code placeholder
func (sa *Application) tryPlaceholderAllocate(nodeIterator func() NodeIterator, 
getNodeFn func(string) *Node) *Allocation {
    ...
    // get all the requests from the app sorted in order
    for _, request := range sa.sortedRequests {
        ...
        phAllocs := sa.getPlaceholderAllocations()
        for _, ph := range phAllocs {
        
            if ph.IsReleased() || ph.IsPreempted() || request.GetTaskGroup() != 
ph.GetTaskGroup() {
                continue
            }
            /
            delta := resources.Sub(ph.GetAllocatedResource(), 
request.GetAllocatedResource())
            if delta.HasNegativeValue() {
                log.Log(log.SchedApplication).Warn("releasing placeholder: real 
allocation is larger than placeholder",
                    zap.Stringer("requested resource", 
request.GetAllocatedResource()),
                    zap.String("placeholderID", ph.GetAllocationID()),
                    zap.Stringer("placeholder resource", 
ph.GetAllocatedResource()))
                // release the placeholder and tell the RM
                ph.SetReleased(true)
                sa.notifyRMAllocationReleased(sa.rmID, []*Allocation{ph}, 
si.TerminationType_TIMEOUT, "cancel placeholder: resource incompatible")
                sa.appEvents.sendPlaceholderLargerEvent(ph, request)
                continue
            }{code}
 
 * In the "removeAllocationInternal" logic, "placeholders" with a releaseType 
of TIMEOUT are not counted.

 
{code:java}
// code placeholder
func (sa *Application) removeAllocationInternal(allocationID string, 
releaseType si.TerminationType) *Allocation {
    alloc := sa.allocations[allocationID]
  ...
    // update correct allocation tracker
    if alloc.IsPlaceholder() {
        // make sure we account for the placeholders being removed in the 
tracking data
        if releaseType == si.TerminationType_STOPPED_BY_RM || releaseType == 
si.TerminationType_PREEMPTED_BY_SCHEDULER || releaseType == 
si.TerminationType_UNKNOWN_TERMINATION_TYPE {
            if _, ok := sa.placeholderData[alloc.taskGroupName]; ok {
                sa.placeholderData[alloc.taskGroupName].TimedOut++
            }
        }
        {code}
 

 
 * The Application “tryAllocate” logic has a “canReplace” check.

 
{code:java}
// code placeholder
func (sa *Application) tryAllocate(headRoom *resources.Resource, 
allowPreemption bool, preemptionDelay time.Duration, preemptAttemptsRemaining 
*int, nodeIterator func() NodeIterator, fullNodeIterator func() NodeIterator, 
getNodeFn func(string) *Node) *Allocation {
    ..
    for _, request := range sa.sortedRequests {
        if request.GetPendingAskRepeat() == 0 {
            continue
        }
        // check if there is a replacement possible
        if sa.canReplace(request) {
            continue
        }
    ... {code}
 
 * Due to TerminationType_TIMEOUT not being counted, it ultimately results in 
the application failing to schedule resources.

{code:java}
// code placeholder
func (sa *Application) canReplace(request *AllocationAsk) bool {
    // a placeholder or a request without task group can never replace a 
placeholder
    if request == nil || request.IsPlaceholder() || request.GetTaskGroup() == 
"" {
        return false
    }
    // get the tracked placeholder data and check if there are still 
placeholder that can be replaced
    if phData, ok := sa.placeholderData[request.GetTaskGroup()]; ok {
        return phData.Count > (phData.Replaced + phData.TimedOut)
    }
    return false
} {code}

> The Pod using gang scheduling is stuck in the Pending state
> -----------------------------------------------------------
>
>                 Key: YUNIKORN-2926
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2926
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: wangzhihui
>            Priority: Minor
>             Fix For: 1.5.0
>
>         Attachments: image-2024-10-15-11-54-33-458.png, image.png
>
>
> desc：
>  The reason for the real allocation is larger than all placeholder，Then 
> release all allocations。Causing all Pods is Pending state.
> !image-2024-10-15-11-54-33-458.png!
> !image.png!
> {code:java}
> // code placeholder
> apiVersion: batch/v1
> kind: Job
> metadata:
>   name: simple-gang-job
> spec:
>   completions: 2
>   parallelism: 2
>   template:
>     metadata:
>       labels:
>         app: sleep
>         applicationId: "simple-gang-job"
>         queue: root.default
>       annotations:
>         yunikorn.apache.org/schedulingPolicyParameters: 
> "placeholderTimeoutInSeconds=30 gangSchedulingStyle=Hard"
>         yunikorn.apache.org/task-group-name: task-group-example
>         yunikorn.apache.org/task-groups: |-
>           [{
>               "name": "task-group-example",
>               "minMember": 2,
>               "minResource": {
>                 "cpu": "100m",
>                 "memory": "50M"
>               },
>               "nodeSelector": {},
>               "tolerations": [],
>               "affinity": {},
>               "topologySpreadConstraints": []
>           }]
>     spec:
>       schedulerName: yunikorn
>       restartPolicy: Never
>       containers:
>         - name: sleep30
>           image: "alpine:latest"
>           command: ["sleep", "99999999"]
>           resources:
>             requests:
>               cpu: "200m"
>               memory: "50M" {code}
> solution：
> If the app is in Hard mode, it will transition to a Failing state. If it is 
> in Soft mode, it will transition to a Resuming state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-2926) The Pod using gang scheduling is stuck in the Pending state

Reply via email to