Derek Dagit created STORM-1072:
----------------------------------
Summary: Nimbus gives incomplete cluster data to scheduler (hides
dead worker slots)
Key: STORM-1072
URL: https://issues.apache.org/jira/browse/STORM-1072
Project: Apache Storm
Issue Type: Bug
Affects Versions: 0.11.0
Reporter: Derek Dagit
1. Describe observed behavior.
Certain slots that have been assigned but have workers that have not yet sent a
heartbeat are treated as "dead" slots, and these are not included in the
cluster summary data that is passed to the scheduler.
[link|https://github.com/apache/storm/blob/8dd9e6e213210009968f39483cb69f271b2e8415/storm-core/src/clj/backtype/storm/daemon/nimbus.clj#L527]
to nimbus code
For topologies whose payload is very large, this can result in scheduler
results that never quite converge due to some of the slots not appearing on
each call to schedule()
2. What is the expected behavior?
Nimbus may be too smart here: it seems better to give the full cluster
information to the scheduler and let the scheduler make the appropriate
decision about how to handle workers that are not yet up.
3. Outline the steps to reproduce the problem.
Either launch a topology with a very large jar file that takes minutes to
download, or simulate by adding a sleep to the supervisor code just after the
jar is downloaded. This will cause a significant delay before the worker is up
and heartbeating in. On each scheduling run, such slots will not even be
present for the scheduler logic.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)