Peter Bacsko created YUNIKORN-2520:
--------------------------------------
Summary: PVC errors in AssumePod() is not handled properly
Key: YUNIKORN-2520
URL: https://issues.apache.org/jira/browse/YUNIKORN-2520
Project: Apache YuniKorn
Issue Type: Bug
Components: shim - kubernetes
Reporter: Peter Bacsko
When there is an error caused by a volume operation in {{{}AssumePod(){}}}, the
allocation on core side will not be removed.
Although we check the result from UpdateAllocation, the error handling is just
logging:
{noformat}
if err := callback.UpdateAllocation(response); err != nil {
rmp.handleUpdateResponseError(rmID, err)
}
...
func (rmp *RMProxy) handleUpdateResponseError(rmID string, err error) {
log.Log(log.RMProxy).Error("failed to handle response",
zap.String("rmID", rmID),
zap.Error(err))
}{noformat}
I suggest moving volume-related code to {{{}Task.postTaskAllocated{}}}. In this
case, the task will transition to "Failed" state and we'll have allocationID
available, so we can release both the ask and the allocation:
{noformat}
func (task *Task) releaseAllocation() {
...
var releaseRequest *si.AllocationRequest
s := TaskStates()
switch task.GetTaskState() {
case s.New, s.Pending, s.Scheduling, s.Rejected:
releaseRequest = common.CreateReleaseAskRequestForTask(
task.applicationID, task.taskID,
task.application.partition) <-- release ask + allocation if possible
default:
if task.allocationID == "" {
... log error ...
return
}
releaseRequest =
common.CreateReleaseAllocationRequestForTask(
task.applicationID, task.taskID,
task.allocationID, task.application.partition, task.terminationType)
}
...{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]