[jira] [Commented] (YUNIKORN-229) shim sends the same remove request twice for a remove allocation

Wilfred Spiegelenburg (Jira) Wed, 08 Jul 2020 23:33:17 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154220#comment-17154220
 ]


Wilfred Spiegelenburg commented on YUNIKORN-229:
------------------------------------------------

If this is a duplicate the shim should handle it and not pass it on to the 
core. The removal has been communicated back to the shim *before* the second 
request is received by the shim. This shows that the shim is not tracking what 
it is doing correctly.

I also see that I made a typo in the timing: it is not 0.16 seconds but *16 
seconds* between the request.

The shim should have had more than enough time to process the response and 
updated its caches. The second removal should not have been generated by the 
shim, whatever the reason is that k8s is asking for the removal again.

> shim sends the same remove request twice for a remove allocation
> ----------------------------------------------------------------
>
>                 Key: YUNIKORN-229
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-229
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: shim - kubernetes
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Ting Yao,Huang
>            Priority: Major
>             Fix For: 0.9
>
>
> In the logs it looks like the shim asks to remove the same allocation using 
> the same UUID:
> First release request from shim:
> {code}
> 2020-06-10T05:54:24.564Z      DEBUG   cache/cluster_info.go:136       
> enqueued event  {"eventType": "*cacheevent.RMUpdateRequestEvent", "event": 
> {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
>  completed"}]},"rmID":"mycluster"}}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.565Z      DEBUG   scheduler/scheduler.go:191      
> enqueued event  {"eventType": 
> "*schedulerevent.SchedulerAllocationUpdatesEvent", "event": 
> {"RejectedAllocations":null,"AcceptedAllocations":null,"NewAsks":null,"ToReleases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
>  completed"}]},"ExistingAllocations":null,"RMId":""}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.565Z      DEBUG   cache/cluster_info.go:136       
> enqueued event  {"eventType": "*cacheevent.ReleaseAllocationsEvent", "event": 
> {"AllocationsToRelease":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","ApplicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","PartitionName":"[mycluster]default","Message":"task
>  completed","ReleaseType":0}]}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.565Z      DEBUG   cache/partition_info.go:429     
> removing allocations    {"appID": "spark-3a34f5a12bc54c24b7d5f02957cff30c", 
> "allocationId": "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
> 2020-06-10T05:54:24.566Z      INFO    cache/partition_info.go:477     
> allocation removed      {"numOfAllocationReleased": 1, "partitionName": 
> "[mycluster]default"}
> 2020-06-10T05:54:24.566Z      DEBUG   rmproxy/rmproxy.go:65   enqueue event   
> {"event": 
> {"RmID":"mycluster","ReleasedAllocations":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
>  completed"}]}, "currentQueueSize": 0}
> 2020-06-10T05:54:24.566Z      DEBUG   callback/scheduler_callback.go:44       
> callback received       {"updateResponse": 
> "releasedAllocations:<UUID:\"3bf0a159-89ee-4bdc-ada1-c577ac2097d1\" 
> message:\"task completed\" > "}
> 2020-06-10T05:54:24.566Z      DEBUG   callback/scheduler_callback.go:119      
> callback: response to released allocations      {"UUID": 
> "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
> {code}
> Second release request from shim 0.16 seconds after the first request:
> {code}
> 2020-06-10T05:54:40.423Z      DEBUG   cache/cluster_info.go:136       
> enqueued event  {"eventType": "*cacheevent.RMUpdateRequestEvent", "event": 
> {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
>  completed"}]},"rmID":"mycluster"}}, "currentQueueSize": 0}
> 2020-06-10T05:54:40.423Z      DEBUG   scheduler/scheduler.go:191      
> enqueued event  {"eventType": 
> "*schedulerevent.SchedulerAllocationUpdatesEvent", "event": 
> {"RejectedAllocations":null,"AcceptedAllocations":null,"NewAsks":null,"ToReleases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","message":"task
>  completed"}]},"ExistingAllocations":null,"RMId":""}, "currentQueueSize": 0}
> 2020-06-10T05:54:40.423Z      DEBUG   cache/cluster_info.go:136       
> enqueued event  {"eventType": "*cacheevent.ReleaseAllocationsEvent", "event": 
> {"AllocationsToRelease":[{"UUID":"3bf0a159-89ee-4bdc-ada1-c577ac2097d1","ApplicationID":"spark-3a34f5a12bc54c24b7d5f02957cff30c","PartitionName":"[mycluster]default","Message":"task
>  completed","ReleaseType":0}]}, "currentQueueSize": 0}
> 2020-06-10T05:54:40.423Z      DEBUG   cache/partition_info.go:429     
> removing allocations    {"appID": "spark-3a34f5a12bc54c24b7d5f02957cff30c", 
> "allocationId": "3bf0a159-89ee-4bdc-ada1-c577ac2097d1"}
> 2020-06-10T05:54:40.423Z      DEBUG   cache/partition_info.go:442     no 
> active allocations found to release  {"appID": 
> "spark-3a34f5a12bc54c24b7d5f02957cff30c"}
> {code}
> The core scheduler handles it correctly and just ignores the request but when 
> the number of tasks in the shim grows this could have a big performance 
> impact and we need to find out why it removes it twice.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-229) shim sends the same remove request twice for a remove allocation

Reply via email to