Github user shuai-xu commented on a diff in the pull request:
https://github.com/apache/flink/pull/4887#discussion_r148715651
--- Diff:
flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManager.java
---
@@ -302,7 +302,12 @@ public boolean unregisterSlotRequest(AllocationID
allocationId) {
PendingSlotRequest pendingSlotRequest =
pendingSlotRequests.remove(allocationId);
if (null != pendingSlotRequest) {
- cancelPendingSlotRequest(pendingSlotRequest);
+ if (pendingSlotRequest.isAssigned()) {
+ cancelPendingSlotRequest(pendingSlotRequest);
+ }
+ else {
+
resourceActions.cancelResourceAllocation(pendingSlotRequest.getResourceProfile());
--- End diff --
Yes, the SlotManager can decide to release the resource more than needed.
But in a worst case:
1. Now the MESOS or YARN cluster have not enough resource.
2. A job ask for 100 worker of size A;
3. As there are not enough resource, the job failover, the previous 100 is
not cancelled, it ask another 100.
4. This repeated several times, the pending requests for worker of size A
reaches 10000.
5. A worker of size B crashed, so the job now only need 100 woker of size A
and 1 worker of size B. But the YARN or MESOS think the job need 10000 A and 1
B as the request are never cancelled.
6. The MESOS/YARN now have some resources for 110 A, more than 100 A and 1
B, and it begin to assign resource for the job, but it first try to allocate
10000 containers of size A, and the job still can not be started as it is lack
of container B.
7. This may cause the job can not be started even when the cluster resource
is now enough in a long time.
8. And this did happen in our cluster, as our cluster is busy. So I think
it's better to keep this protocol, and different resource managers can treat
this protocol according to their need.
---