[
https://issues.apache.org/jira/browse/STORM-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182628#comment-14182628
]
ASF GitHub Bot commented on STORM-513:
--------------------------------------
Github user itaifrenkel commented on a diff in the pull request:
https://github.com/apache/storm/pull/286#discussion_r19330106
--- Diff: storm-core/src/jvm/backtype/storm/task/ShellBolt.java ---
@@ -305,4 +283,95 @@ private void die(Throwable exception) {
System.exit(11);
}
}
+
+ private class BoltHeartbeatTimerTask extends TimerTask {
+ private ShellBolt bolt;
+
+ public BoltHeartbeatTimerTask(ShellBolt bolt) {
+ this.bolt = bolt;
+ }
+
+ @Override
+ public void run() {
+ long currentTimeMillis = System.currentTimeMillis();
+ long lastHeartbeat = getLastHeartbeat();
+
+ LOG.debug("BOLT - current time : {}, last heartbeat : {},
worker timeout (ms) : {}",
+ currentTimeMillis, lastHeartbeat, workerTimeoutMills);
+
+ if (currentTimeMillis - lastHeartbeat > workerTimeoutMills) {
+ bolt.die(new RuntimeException("subprocess heartbeat
timeout"));
+ }
+
+ String genId = Long.toString(_rand.nextLong());
+ try {
+ _pendingWrites.put(createHeartbeatBoltMessage(genId));
--- End diff --
That sounds very good :+1:
> ShellBolt keeps sending heartbeats even when child process is hung
> ------------------------------------------------------------------
>
> Key: STORM-513
> URL: https://issues.apache.org/jira/browse/STORM-513
> Project: Apache Storm
> Issue Type: Bug
> Affects Versions: 0.9.2-incubating
> Environment: Linux: 2.6.32-431.11.2.el6.x86_64 (RHEL 6.5)
> Reporter: Dan Blanchard
> Priority: Blocker
> Fix For: 0.9.3-rc2
>
>
> If I'm understanding everything correctly with how ShellBolts work, the Java
> ShellBolt executor is the part of the topology that sends heartbeats back to
> Nimbus to let it know that a particular multilang bolt is still alive. The
> problem with this is that if the multilang subprocess/bolt severely hangs
> (i.e., it will not even respond to {{SIGALRM}} and the like), the Java
> ShellBolt does not seem to notice or care. Simply having the tuple get
> replayed when it times out will not suffice either, because the subprocess
> will still be stuck.
> The most obvious way to handle this seem to be to add heartbeating to the
> multilang protocol itself, so that the ShellBolt expects a message of some
> kind every {{timeout}} seconds.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)