Github user revans2 commented on a diff in the pull request:
https://github.com/apache/storm/pull/2918#discussion_r239517824
--- Diff:
storm-server/src/main/java/org/apache/storm/scheduler/blacklist/strategies/RasBlacklistStrategy.java
---
@@ -79,25 +81,46 @@
if (shortage.areAnyOverZero() || shortageSlots > 0) {
LOG.info("Need {} and {} slots more. Releasing some
blacklisted nodes to cover it.", shortage, shortageSlots);
- //release earliest blacklist
- for (String supervisor : blacklistedNodeIds) {
- SupervisorDetails sd =
availableSupervisors.get(supervisor);
- if (sd != null) {
- NormalizedResourcesWithMemory sdAvailable =
cluster.getAvailableResources(sd);
- int sdAvailableSlots =
cluster.getAvailablePorts(sd).size();
- readyToRemove.add(supervisor);
- shortage.remove(sdAvailable,
cluster.getResourceMetrics());
- shortageSlots -= sdAvailableSlots;
- LOG.debug("Releasing {} with {} and {} slots
leaving {} and {} slots to go", supervisor,
- sdAvailable, sdAvailableSlots, shortage,
shortageSlots);
- if (!shortage.areAnyOverZero() && shortageSlots <=
0) {
- // we have enough resources now...
- break;
+
+ //release earliest blacklist - but release all supervisors
on a given blacklisted host.
+ Map<String, Set<String>> hostToSupervisorIds =
createHostToSupervisorMap(blacklistedNodeIds, cluster);
+ for (Set<String> supervisorIds :
hostToSupervisorIds.values()) {
+ for (String supervisorId : supervisorIds) {
+ SupervisorDetails sd =
availableSupervisors.get(supervisorId);
+ if (sd != null) {
+ NormalizedResourcesWithMemory sdAvailable =
cluster.getAvailableResources(sd);
+ int sdAvailableSlots =
cluster.getAvailablePorts(sd).size();
+ readyToRemove.add(supervisorId);
+ shortage.remove(sdAvailable,
cluster.getResourceMetrics());
+ shortageSlots -= sdAvailableSlots;
+ LOG.info("Releasing {} with {} and {} slots
leaving {} and {} slots to go", supervisorId,
+ sdAvailable, sdAvailableSlots,
shortage, shortageSlots);
--- End diff --
@danny0405 This code only happens when the cluster is full. Meaning a
topology is not scheduled and there are not enough free resources on the
cluster to run that topology. The idea is that if a cluster is full and there
are blacklisted supervisors it is better to try and run things on possibly bad
hosts instead of not running anything. The issue here is that the existing
code is broken if there are multiple supervisors on a single node, and this is
fixing the existing code to operate properly in that case.
If you have problems with releasing blacklisted supervisors/nodes at all
then that is a separate JIRA.
---