[
https://issues.apache.org/jira/browse/YUNIKORN-229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154193#comment-17154193
]
Weiwei Yang commented on YUNIKORN-229:
--------------------------------------
I just had a conversation with [~Huang Ting Yao]. This looks like an
intermittent issue, Tingyao tried and confirmed this is not happening all the
time. My guess is somehow K8s sends the same deletePod event twice which
triggers this. The good thing is we already have handled this correctly:
{quote}
2020-06-10T05:54:40.423Z DEBUG cache/partition_info.go:442 no
active allocations found to release {"appID":
"spark-3a34f5a12bc54c24b7d5f02957cff30c"}
{quote}
I don't think there is anything else that needs to be done currently. Let's
close this for now. Reopen when we see it again or we know how to reproduce.
> shim sends the same remove request twice for a remove allocation
> ----------------------------------------------------------------
>
> Key: YUNIKORN-229
> URL: https://issues.apache.org/jira/browse/YUNIKORN-229
> Project: Apache YuniKorn
> Issue Type: Improvement
> Components: shim - kubernetes
> Reporter: Wilfred Spiegelenburg
> Assignee: Ting Yao,Huang
> Priority: Major
>
> In the logs it looks like the shim asks to remove the same allocation using
> the same UUID:
> First release request from shim:
> {code}
> 2020-06-10T05:54:24.564Z DEBUG cache/cluster_info.go:136
> enqueued event {"eventType": "*cacheevent.RMUpdateRequestEvent", "event":
> {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
> completed"}]},"rmID":"mycluster"}}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.565Z DEBUG scheduler/scheduler.go:191
> enqueued event {"eventType":
> "*schedulerevent.SchedulerAllocationUpdatesEvent", "event":
> {"RejectedAllocations":null,"AcceptedAllocations":null,"NewAsks":null,"ToReleases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
> completed"}]},"ExistingAllocations":null,"RMId":""}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.565Z DEBUG cache/cluster_info.go:136
> enqueued event {"eventType": "*cacheevent.ReleaseAllocationsEvent", "event":
> {"AllocationsToRelease":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","ApplicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","PartitionName":"[mycluster]default","Message":"task
> completed","ReleaseType":0}]}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.565Z DEBUG cache/partition_info.go:429
> removing allocations {"appID": "spark-3a34f5a12bc54c24b7d5f02957cff30c",
> "allocationId": "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
> 2020-06-10T05:54:24.566Z INFO cache/partition_info.go:477
> allocation removed {"numOfAllocationReleased": 1, "partitionName":
> "[mycluster]default"}
> 2020-06-10T05:54:24.566Z DEBUG rmproxy/rmproxy.go:65 enqueue event
> {"event":
> {"RmID":"mycluster","ReleasedAllocations":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
> completed"}]}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.566Z DEBUG callback/scheduler_callback.go:44
> callback received {"updateResponse":
> "releasedAllocations:<UUID:\"3bf0a159-89ee-4bdc-ada1-c577ac2097d1\"
> message:\"task completed\" > "}
> 2020-06-10T05:54:24.566Z DEBUG callback/scheduler_callback.go:119
> callback: response to released allocations {"UUID":
> "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
> {code}
> Second release request from shim 0.16 seconds after the first request:
> {code}
> 2020-06-10T05:54:40.423Z DEBUG cache/cluster_info.go:136
> enqueued event {"eventType": "*cacheevent.RMUpdateRequestEvent", "event":
> {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
> completed"}]},"rmID":"mycluster"}}, "currentQueueSize": 0}
> 2020-06-10T05:54:40.423Z DEBUG scheduler/scheduler.go:191
> enqueued event {"eventType":
> "*schedulerevent.SchedulerAllocationUpdatesEvent", "event":
> {"RejectedAllocations":null,"AcceptedAllocations":null,"NewAsks":null,"ToReleases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
> completed"}]},"ExistingAllocations":null,"RMId":""}, "currentQueueSize": 0}
> 2020-06-10T05:54:40.423Z DEBUG cache/cluster_info.go:136
> enqueued event {"eventType": "*cacheevent.ReleaseAllocationsEvent", "event":
> {"AllocationsToRelease":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","ApplicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","PartitionName":"[mycluster]default","Message":"task
> completed","ReleaseType":0}]}, "currentQueueSize": 0}
> 2020-06-10T05:54:40.423Z DEBUG cache/partition_info.go:429
> removing allocations {"appID": "spark-3a34f5a12bc54c24b7d5f02957cff30c",
> "allocationId": "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
> 2020-06-10T05:54:40.423Z DEBUG cache/partition_info.go:442 no
> active allocations found to release {"appID":
> "spark-3a34f5a12bc54c24b7d5f02957cff30c"}
> {code}
> The core scheduler handles it correctly and just ignores the request but when
> the number of tasks in the shim grows this could have a big performance
> impact and we need to find out why it removes it twice.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]