Derek Dagit created STORM-1072:
----------------------------------

             Summary: Nimbus gives incomplete cluster data to scheduler (hides 
dead worker slots)
                 Key: STORM-1072
                 URL: https://issues.apache.org/jira/browse/STORM-1072
             Project: Apache Storm
          Issue Type: Bug
    Affects Versions: 0.11.0
            Reporter: Derek Dagit


1. Describe observed behavior.

Certain slots that have been assigned but have workers that have not yet sent a 
heartbeat are treated as "dead" slots, and these are not included in the 
cluster summary data that is passed to the scheduler.

[link|https://github.com/apache/storm/blob/8dd9e6e213210009968f39483cb69f271b2e8415/storm-core/src/clj/backtype/storm/daemon/nimbus.clj#L527]
 to nimbus code

For topologies whose payload is very large, this can result in scheduler 
results that never quite converge due to some of the slots not appearing on 
each call to schedule()


2. What is the expected behavior?

Nimbus may be too smart here: it seems better to give the full cluster 
information to the scheduler and let the scheduler make the appropriate 
decision about how to handle workers that are not yet up.

3. Outline the steps to reproduce the problem.

Either launch a topology with a very large jar file that takes minutes to 
download, or simulate by adding a sleep to the supervisor code just after the 
jar is downloaded.  This will cause a significant delay before the worker is up 
and heartbeating in.  On each scheduling run, such slots will not even be 
present for the scheduler logic.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to