Github user danny0405 commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2918#discussion_r239306840
  
    --- Diff: 
storm-server/src/main/java/org/apache/storm/scheduler/blacklist/strategies/RasBlacklistStrategy.java
 ---
    @@ -79,25 +81,46 @@
     
                 if (shortage.areAnyOverZero() || shortageSlots > 0) {
                     LOG.info("Need {} and {} slots more. Releasing some 
blacklisted nodes to cover it.", shortage, shortageSlots);
    -                //release earliest blacklist
    -                for (String supervisor : blacklistedNodeIds) {
    -                    SupervisorDetails sd = 
availableSupervisors.get(supervisor);
    -                    if (sd != null) {
    -                        NormalizedResourcesWithMemory sdAvailable = 
cluster.getAvailableResources(sd);
    -                        int sdAvailableSlots = 
cluster.getAvailablePorts(sd).size();
    -                        readyToRemove.add(supervisor);
    -                        shortage.remove(sdAvailable, 
cluster.getResourceMetrics());
    -                        shortageSlots -= sdAvailableSlots;
    -                        LOG.debug("Releasing {} with {} and {} slots 
leaving {} and {} slots to go", supervisor,
    -                            sdAvailable, sdAvailableSlots, shortage, 
shortageSlots);
    -                        if (!shortage.areAnyOverZero() && shortageSlots <= 
0) {
    -                            // we have enough resources now...
    -                            break;
    +
    +                //release earliest blacklist - but release all supervisors 
on a given blacklisted host.
    +                Map<String, Set<String>> hostToSupervisorIds = 
createHostToSupervisorMap(blacklistedNodeIds, cluster);
    +                for (Set<String> supervisorIds : 
hostToSupervisorIds.values()) {
    +                    for (String supervisorId : supervisorIds) {
    +                        SupervisorDetails sd = 
availableSupervisors.get(supervisorId);
    +                        if (sd != null) {
    +                            NormalizedResourcesWithMemory sdAvailable = 
cluster.getAvailableResources(sd);
    +                            int sdAvailableSlots = 
cluster.getAvailablePorts(sd).size();
    +                            readyToRemove.add(supervisorId);
    +                            shortage.remove(sdAvailable, 
cluster.getResourceMetrics());
    +                            shortageSlots -= sdAvailableSlots;
    +                            LOG.info("Releasing {} with {} and {} slots 
leaving {} and {} slots to go", supervisorId,
    +                                    sdAvailable, sdAvailableSlots, 
shortage, shortageSlots);
    --- End diff --
    
    @agresch Got your idea.
    In my personal opinion, for production, if a node has a blacklist 
supervisor(which means the supervisor does not send any HBs for some time), 
most of the cases it because the node machine itself has some problems(for now 
there are a few causes like: disk is full or network is in disconnection), so a 
safer way is we never schedule workers to the node if there are some blacklist 
supervisors on it.
    
    If you want to make use of the healthy supervisor on the node(has some 
blacklist supervisors also), at least there is a decision to make sure the 
supervisor is healthy, we can do this through checking the heartbeats of it.


---

Reply via email to