Github user revans2 commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2918#discussion_r239517824
  
    --- Diff: 
storm-server/src/main/java/org/apache/storm/scheduler/blacklist/strategies/RasBlacklistStrategy.java
 ---
    @@ -79,25 +81,46 @@
     
                 if (shortage.areAnyOverZero() || shortageSlots > 0) {
                     LOG.info("Need {} and {} slots more. Releasing some 
blacklisted nodes to cover it.", shortage, shortageSlots);
    -                //release earliest blacklist
    -                for (String supervisor : blacklistedNodeIds) {
    -                    SupervisorDetails sd = 
availableSupervisors.get(supervisor);
    -                    if (sd != null) {
    -                        NormalizedResourcesWithMemory sdAvailable = 
cluster.getAvailableResources(sd);
    -                        int sdAvailableSlots = 
cluster.getAvailablePorts(sd).size();
    -                        readyToRemove.add(supervisor);
    -                        shortage.remove(sdAvailable, 
cluster.getResourceMetrics());
    -                        shortageSlots -= sdAvailableSlots;
    -                        LOG.debug("Releasing {} with {} and {} slots 
leaving {} and {} slots to go", supervisor,
    -                            sdAvailable, sdAvailableSlots, shortage, 
shortageSlots);
    -                        if (!shortage.areAnyOverZero() && shortageSlots <= 
0) {
    -                            // we have enough resources now...
    -                            break;
    +
    +                //release earliest blacklist - but release all supervisors 
on a given blacklisted host.
    +                Map<String, Set<String>> hostToSupervisorIds = 
createHostToSupervisorMap(blacklistedNodeIds, cluster);
    +                for (Set<String> supervisorIds : 
hostToSupervisorIds.values()) {
    +                    for (String supervisorId : supervisorIds) {
    +                        SupervisorDetails sd = 
availableSupervisors.get(supervisorId);
    +                        if (sd != null) {
    +                            NormalizedResourcesWithMemory sdAvailable = 
cluster.getAvailableResources(sd);
    +                            int sdAvailableSlots = 
cluster.getAvailablePorts(sd).size();
    +                            readyToRemove.add(supervisorId);
    +                            shortage.remove(sdAvailable, 
cluster.getResourceMetrics());
    +                            shortageSlots -= sdAvailableSlots;
    +                            LOG.info("Releasing {} with {} and {} slots 
leaving {} and {} slots to go", supervisorId,
    +                                    sdAvailable, sdAvailableSlots, 
shortage, shortageSlots);
    --- End diff --
    
    @danny0405 This code only happens when the cluster is full.  Meaning a 
topology is not scheduled and there are not enough free resources on the 
cluster to run that topology. The idea is that if a cluster is full and there 
are blacklisted supervisors it is better to try and run things on possibly bad 
hosts instead of not running anything.  The issue here is that the existing 
code is broken if there are multiple supervisors on a single node, and this is 
fixing the existing code to operate properly in that case.
    
    If you have problems with releasing blacklisted supervisors/nodes at all 
then that is a separate JIRA.


---

Reply via email to