[
https://issues.apache.org/jira/browse/STORM-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426632#comment-15426632
]
ASF GitHub Bot commented on STORM-1707:
---------------------------------------
Github user abellina commented on a diff in the pull request:
https://github.com/apache/storm/pull/1370#discussion_r75329245
--- Diff:
storm-core/src/jvm/org/apache/storm/daemon/supervisor/SyncProcessEvent.java ---
@@ -110,34 +117,47 @@ public void run() {
keeperWorkerIds.add(entry.getKey());
keepPorts.add(stateHeartbeat.getHeartbeat().get_port());
}
+ if (stateHeartbeat.getState() == State.NOT_STARTED) {
+ keeperWorkerIds.add(entry.getKey());
+
keepPorts.add(supervisorData.getWorkerIdsToPorts().get(entry.getKey()));
+ }
}
Map<Integer, LocalAssignment> reassignExecutors =
getReassignExecutors(assignedExecutors, keepPorts);
Map<Integer, String> newWorkerIds = new HashMap<>();
for (Integer port : reassignExecutors.keySet()) {
newWorkerIds.put(port, Utils.uuid());
}
+
LOG.debug("Assigned executors: {}", assignedExecutors);
LOG.debug("Allocated: {}", localWorkerStats);
+ LOG.debug("Keeper worker ids: {}", keeperWorkerIds);
+ LOG.debug("Keep ports: {}", keepPorts);
+ LOG.debug("LaunchTimes: {}",
supervisorData.getWorkerIdsToLaunchTimes());
+ LOG.debug("Ids Ports: {}",
supervisorData.getWorkerIdsToPorts());
for (Map.Entry<String, StateHeartbeat> entry :
localWorkerStats.entrySet()) {
StateHeartbeat stateHeartbeat = entry.getValue();
- if (stateHeartbeat.getState() != State.VALID) {
+ if ((stateHeartbeat.getState() != State.VALID &&
stateHeartbeat.getState() != State.NOT_STARTED) ||
+ (stateHeartbeat.getState() == State.NOT_STARTED &&
isWorkerStartTimeoutExpired(entry.getKey()))) {
LOG.info("Shutting down and clearing state for id {},
Current supervisor time: {}, State: {}, Heartbeat: {}", entry.getKey(), now,
--- End diff --
The clojure equivalent for this logs also when the worker could not be
started. So, in addition to the "Shutting down" message, there is a "Worker X
failed to start" message in the case where the timeout expired (e.g. state is
NOT_STARTED). That's a nice thing to have I think (should probably be a
LOG.error).
> Improve supervisor latency by removing 2-min wait
> -------------------------------------------------
>
> Key: STORM-1707
> URL: https://issues.apache.org/jira/browse/STORM-1707
> Project: Apache Storm
> Issue Type: Improvement
> Reporter: Paul Poulosky
> Assignee: Paul Poulosky
>
> After launching workers, the supervisor waits up to 2 minutes synchronously
> for the workers to be "launched".
> We should remove this, and instead keep track of launch time, making the
> "killer" function smart enough to determine the difference between a worker
> that's still launching, one that's timed out, etc.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)