[
https://issues.apache.org/jira/browse/YUNIKORN-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890085#comment-17890085
]
wangzhihui commented on YUNIKORN-2926:
--------------------------------------
hi,[~wilfreds]
I need to add more detailed explanations to this issue.
case example:
* The task-group minResource is less than containers define resource size
{code:java}
// code placeholder
yunikorn.apache.org/task-groups: |-
[{
"name": "task-group-example",
"minMember": 2,
"minResource": {
"cpu": "100m",
"memory": "50M"
},
...
}]
containers:
- name: sleep30
image: "alpine:latest"
command: ["sleep", "99999999"]
resources:
requests:
cpu: "200m"
memory: "50M" {code}
*alysis:*
* After all the "placeholders" are applied to the resources, Then the
"tryPlaceholderAllocate" logic will be triggered.
* In "tryPlaceholderAllocate" have a a resource size matching check. If the
real allocation is larger than all placeholders. The application will release
all placeholder resources and it will exit gangScheduling.
* The terminationType here is TerminationType_TIMEOUT
{code:java}
// code placeholder
func (sa *Application) tryPlaceholderAllocate(nodeIterator func() NodeIterator,
getNodeFn func(string) *Node) *Allocation {
...
// get all the requests from the app sorted in order
for _, request := range sa.sortedRequests {
...
phAllocs := sa.getPlaceholderAllocations()
for _, ph := range phAllocs {
if ph.IsReleased() || ph.IsPreempted() || request.GetTaskGroup() !=
ph.GetTaskGroup() {
continue
}
/
delta := resources.Sub(ph.GetAllocatedResource(),
request.GetAllocatedResource())
if delta.HasNegativeValue() {
log.Log(log.SchedApplication).Warn("releasing placeholder: real
allocation is larger than placeholder",
zap.Stringer("requested resource",
request.GetAllocatedResource()),
zap.String("placeholderID", ph.GetAllocationID()),
zap.Stringer("placeholder resource",
ph.GetAllocatedResource()))
// release the placeholder and tell the RM
ph.SetReleased(true)
sa.notifyRMAllocationReleased(sa.rmID, []*Allocation{ph},
si.TerminationType_TIMEOUT, "cancel placeholder: resource incompatible")
sa.appEvents.sendPlaceholderLargerEvent(ph, request)
continue
}{code}
* In the "removeAllocationInternal" logic, "placeholders" with a releaseType
of TIMEOUT are not counted.
{code:java}
// code placeholder
func (sa *Application) removeAllocationInternal(allocationID string,
releaseType si.TerminationType) *Allocation {
alloc := sa.allocations[allocationID]
...
// update correct allocation tracker
if alloc.IsPlaceholder() {
// make sure we account for the placeholders being removed in the
tracking data
if releaseType == si.TerminationType_STOPPED_BY_RM || releaseType ==
si.TerminationType_PREEMPTED_BY_SCHEDULER || releaseType ==
si.TerminationType_UNKNOWN_TERMINATION_TYPE {
if _, ok := sa.placeholderData[alloc.taskGroupName]; ok {
sa.placeholderData[alloc.taskGroupName].TimedOut++
}
}
{code}
* The Application “tryAllocate” logic has a “canReplace” check.
{code:java}
// code placeholder
func (sa *Application) tryAllocate(headRoom *resources.Resource,
allowPreemption bool, preemptionDelay time.Duration, preemptAttemptsRemaining
*int, nodeIterator func() NodeIterator, fullNodeIterator func() NodeIterator,
getNodeFn func(string) *Node) *Allocation {
..
for _, request := range sa.sortedRequests {
if request.GetPendingAskRepeat() == 0 {
continue
}
// check if there is a replacement possible
if sa.canReplace(request) {
continue
}
... {code}
* Due to TerminationType_TIMEOUT not being counted, it ultimately results in
the application failing to schedule resources.
{code:java}
// code placeholder
func (sa *Application) canReplace(request *AllocationAsk) bool {
// a placeholder or a request without task group can never replace a
placeholder
if request == nil || request.IsPlaceholder() || request.GetTaskGroup() ==
"" {
return false
}
// get the tracked placeholder data and check if there are still
placeholder that can be replaced
if phData, ok := sa.placeholderData[request.GetTaskGroup()]; ok {
return phData.Count > (phData.Replaced + phData.TimedOut)
}
return false
} {code}
> The Pod using gang scheduling is stuck in the Pending state
> -----------------------------------------------------------
>
> Key: YUNIKORN-2926
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2926
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: wangzhihui
> Priority: Minor
> Fix For: 1.5.0
>
> Attachments: image-2024-10-15-11-54-33-458.png, image.png
>
>
> desc:
> The reason for the real allocation is larger than all placeholder,Then
> release all allocations。Causing all Pods is Pending state.
> !image-2024-10-15-11-54-33-458.png!
> !image.png!
> {code:java}
> // code placeholder
> apiVersion: batch/v1
> kind: Job
> metadata:
> name: simple-gang-job
> spec:
> completions: 2
> parallelism: 2
> template:
> metadata:
> labels:
> app: sleep
> applicationId: "simple-gang-job"
> queue: root.default
> annotations:
> yunikorn.apache.org/schedulingPolicyParameters:
> "placeholderTimeoutInSeconds=30 gangSchedulingStyle=Hard"
> yunikorn.apache.org/task-group-name: task-group-example
> yunikorn.apache.org/task-groups: |-
> [{
> "name": "task-group-example",
> "minMember": 2,
> "minResource": {
> "cpu": "100m",
> "memory": "50M"
> },
> "nodeSelector": {},
> "tolerations": [],
> "affinity": {},
> "topologySpreadConstraints": []
> }]
> spec:
> schedulerName: yunikorn
> restartPolicy: Never
> containers:
> - name: sleep30
> image: "alpine:latest"
> command: ["sleep", "99999999"]
> resources:
> requests:
> cpu: "200m"
> memory: "50M" {code}
> solution:
> If the app is in Hard mode, it will transition to a Failing state. If it is
> in Soft mode, it will transition to a Resuming state.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]