[
https://issues.apache.org/jira/browse/YUNIKORN-229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wilfred Spiegelenburg updated YUNIKORN-229:
-------------------------------------------
Priority: Minor (was: Major)
> shim sends the same remove request twice for a remove allocation
> ----------------------------------------------------------------
>
> Key: YUNIKORN-229
> URL: https://issues.apache.org/jira/browse/YUNIKORN-229
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: shim - kubernetes
> Reporter: Wilfred Spiegelenburg
> Assignee: Ting Yao,Huang
> Priority: Minor
> Labels: backlog
>
> In the logs it looks like the shim asks to remove the same allocation using
> the same UUID:
> First release request from shim:
> {code}
> 2020-06-10T05:54:24.564Z DEBUG cache/cluster_info.go:136
> enqueued event {"eventType": "*cacheevent.RMUpdateRequestEvent", "event":
> {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
> completed"}]},"rmID":"mycluster"}}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.565Z DEBUG scheduler/scheduler.go:191
> enqueued event {"eventType":
> "*schedulerevent.SchedulerAllocationUpdatesEvent", "event":
> {"RejectedAllocations":null,"AcceptedAllocations":null,"NewAsks":null,"ToReleases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
> completed"}]},"ExistingAllocations":null,"RMId":""}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.565Z DEBUG cache/cluster_info.go:136
> enqueued event {"eventType": "*cacheevent.ReleaseAllocationsEvent", "event":
> {"AllocationsToRelease":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","ApplicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","PartitionName":"[mycluster]default","Message":"task
> completed","ReleaseType":0}]}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.565Z DEBUG cache/partition_info.go:429
> removing allocations {"appID": "spark-3a34f5a12bc54c24b7d5f02957cff30c",
> "allocationId": "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
> 2020-06-10T05:54:24.566Z INFO cache/partition_info.go:477
> allocation removed {"numOfAllocationReleased": 1, "partitionName":
> "[mycluster]default"}
> 2020-06-10T05:54:24.566Z DEBUG rmproxy/rmproxy.go:65 enqueue event
> {"event":
> {"RmID":"mycluster","ReleasedAllocations":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
> completed"}]}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.566Z DEBUG callback/scheduler_callback.go:44
> callback received {"updateResponse":
> "releasedAllocations:<UUID:\"3bf0a159-89ee-4bdc-ada1-c577ac2097d1\"
> message:\"task completed\" > "}
> 2020-06-10T05:54:24.566Z DEBUG callback/scheduler_callback.go:119
> callback: response to released allocations {"UUID":
> "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
> {code}
> Second release request from shim 0.16 seconds after the first request:
> {code}
> 2020-06-10T05:54:40.423Z DEBUG cache/cluster_info.go:136
> enqueued event {"eventType": "*cacheevent.RMUpdateRequestEvent", "event":
> {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
> completed"}]},"rmID":"mycluster"}}, "currentQueueSize": 0}
> 2020-06-10T05:54:40.423Z DEBUG scheduler/scheduler.go:191
> enqueued event {"eventType":
> "*schedulerevent.SchedulerAllocationUpdatesEvent", "event":
> {"RejectedAllocations":null,"AcceptedAllocations":null,"NewAsks":null,"ToReleases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
> completed"}]},"ExistingAllocations":null,"RMId":""}, "currentQueueSize": 0}
> 2020-06-10T05:54:40.423Z DEBUG cache/cluster_info.go:136
> enqueued event {"eventType": "*cacheevent.ReleaseAllocationsEvent", "event":
> {"AllocationsToRelease":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","ApplicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","PartitionName":"[mycluster]default","Message":"task
> completed","ReleaseType":0}]}, "currentQueueSize": 0}
> 2020-06-10T05:54:40.423Z DEBUG cache/partition_info.go:429
> removing allocations {"appID": "spark-3a34f5a12bc54c24b7d5f02957cff30c",
> "allocationId": "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
> 2020-06-10T05:54:40.423Z DEBUG cache/partition_info.go:442 no
> active allocations found to release {"appID":
> "spark-3a34f5a12bc54c24b7d5f02957cff30c"}
> {code}
> The core scheduler handles it correctly and just ignores the request but when
> the number of tasks in the shim grows this could have a big performance
> impact and we need to find out why it removes it twice.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]