Github user revans2 commented on a diff in the pull request:
https://github.com/apache/storm/pull/2918#discussion_r239519351
--- Diff:
storm-server/src/main/java/org/apache/storm/scheduler/blacklist/strategies/RasBlacklistStrategy.java
---
@@ -79,25 +81,46 @@
if (shortage.areAnyOverZero() || shortageSlots > 0) {
LOG.info("Need {} and {} slots more. Releasing some
blacklisted nodes to cover it.", shortage, shortageSlots);
- //release earliest blacklist
- for (String supervisor : blacklistedNodeIds) {
- SupervisorDetails sd =
availableSupervisors.get(supervisor);
- if (sd != null) {
- NormalizedResourcesWithMemory sdAvailable =
cluster.getAvailableResources(sd);
- int sdAvailableSlots =
cluster.getAvailablePorts(sd).size();
- readyToRemove.add(supervisor);
- shortage.remove(sdAvailable,
cluster.getResourceMetrics());
- shortageSlots -= sdAvailableSlots;
- LOG.debug("Releasing {} with {} and {} slots
leaving {} and {} slots to go", supervisor,
- sdAvailable, sdAvailableSlots, shortage,
shortageSlots);
- if (!shortage.areAnyOverZero() && shortageSlots <=
0) {
- // we have enough resources now...
- break;
+
+ //release earliest blacklist - but release all supervisors
on a given blacklisted host.
+ Map<String, Set<String>> hostToSupervisorIds =
createHostToSupervisorMap(blacklistedNodeIds, cluster);
+ for (Set<String> supervisorIds :
hostToSupervisorIds.values()) {
+ for (String supervisorId : supervisorIds) {
+ SupervisorDetails sd =
availableSupervisors.get(supervisorId);
+ if (sd != null) {
+ NormalizedResourcesWithMemory sdAvailable =
cluster.getAvailableResources(sd);
+ int sdAvailableSlots =
cluster.getAvailablePorts(sd).size();
+ readyToRemove.add(supervisorId);
+ shortage.remove(sdAvailable,
cluster.getResourceMetrics());
+ shortageSlots -= sdAvailableSlots;
+ LOG.info("Releasing {} with {} and {} slots
leaving {} and {} slots to go", supervisorId,
+ sdAvailable, sdAvailableSlots,
shortage, shortageSlots);
}
}
+ // make sure we've handled all supervisors on the host
before we break
+ if (!shortage.areAnyOverZero() && shortageSlots <= 0) {
+ // we have enough resources now...
+ break;
--- End diff --
I think we need to break after we have released all of the supervisors on
the node, otherwise we can have released a single supervisor on a node and got
enough resources, but but because there is still at least one supervisor on
that node still blacklisted the whole node is still blacklisted.
---