Wilfred Spiegelenburg created YUNIKORN-229:
----------------------------------------------
Summary: shim sends the same remove request twice for a remove re
Key: YUNIKORN-229
URL: https://issues.apache.org/jira/browse/YUNIKORN-229
Project: Apache YuniKorn
Issue Type: Improvement
Components: shim - kubernetes
Reporter: Wilfred Spiegelenburg
In the logs it looks like the shim asks to remove the same allocation using the
same UUID:
First release request from shim:
{code}
2020-06-10T05:54:24.564Z DEBUG cache/cluster_info.go:136
enqueued event {"eventType": "*cacheevent.RMUpdateRequestEvent", "event":
{"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
completed"}]},"rmID":"mycluster"}}, "currentQueueSize": 0}
2020-06-10T05:54:24.565Z DEBUG scheduler/scheduler.go:191
enqueued event {"eventType":
"*schedulerevent.SchedulerAllocationUpdatesEvent", "event":
{"RejectedAllocations":null,"AcceptedAllocations":null,"NewAsks":null,"ToReleases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
completed"}]},"ExistingAllocations":null,"RMId":""}, "currentQueueSize": 0}
2020-06-10T05:54:24.565Z DEBUG cache/cluster_info.go:136
enqueued event {"eventType": "*cacheevent.ReleaseAllocationsEvent", "event":
{"AllocationsToRelease":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","ApplicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","PartitionName":"[mycluster]default","Message":"task
completed","ReleaseType":0}]}, "currentQueueSize": 0}
2020-06-10T05:54:24.565Z DEBUG cache/partition_info.go:429
removing allocations {"appID": "spark-3a34f5a12bc54c24b7d5f02957cff30c",
"allocationId": "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
2020-06-10T05:54:24.566Z INFO cache/partition_info.go:477
allocation removed {"numOfAllocationReleased": 1, "partitionName":
"[mycluster]default"}
2020-06-10T05:54:24.566Z DEBUG rmproxy/rmproxy.go:65 enqueue event
{"event":
{"RmID":"mycluster","ReleasedAllocations":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
completed"}]}, "currentQueueSize": 0}
2020-06-10T05:54:24.566Z DEBUG callback/scheduler_callback.go:44
callback received {"updateResponse":
"releasedAllocations:<UUID:\"3bf0a159-89ee-4bdc-ada1-c577ac2097d1\"
message:\"task completed\" > "}
2020-06-10T05:54:24.566Z DEBUG callback/scheduler_callback.go:119
callback: response to released allocations {"UUID":
"3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
{code}
Second release request from shim 0.16 seconds after the first request:
{code}
2020-06-10T05:54:40.423Z DEBUG cache/cluster_info.go:136
enqueued event {"eventType": "*cacheevent.RMUpdateRequestEvent", "event":
{"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
completed"}]},"rmID":"mycluster"}}, "currentQueueSize": 0}
2020-06-10T05:54:40.423Z DEBUG scheduler/scheduler.go:191
enqueued event {"eventType":
"*schedulerevent.SchedulerAllocationUpdatesEvent", "event":
{"RejectedAllocations":null,"AcceptedAllocations":null,"NewAsks":null,"ToReleases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
completed"}]},"ExistingAllocations":null,"RMId":""}, "currentQueueSize": 0}
2020-06-10T05:54:40.423Z DEBUG cache/cluster_info.go:136
enqueued event {"eventType": "*cacheevent.ReleaseAllocationsEvent", "event":
{"AllocationsToRelease":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","ApplicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","PartitionName":"[mycluster]default","Message":"task
completed","ReleaseType":0}]}, "currentQueueSize": 0}
2020-06-10T05:54:40.423Z DEBUG cache/partition_info.go:429
removing allocations {"appID": "spark-3a34f5a12bc54c24b7d5f02957cff30c",
"allocationId": "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
2020-06-10T05:54:40.423Z DEBUG cache/partition_info.go:442 no
active allocations found to release {"appID":
"spark-3a34f5a12bc54c24b7d5f02957cff30c"}
{code}
The core scheduler handles it correctly and just ignores the request but when
the number of tasks in the shim grows this could have a big performance impact
and we need to find out why it removes it twice.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]