[ https://issues.apache.org/jira/browse/STORM-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168132#comment-14168132 ]
ASF GitHub Bot commented on STORM-513: -------------------------------------- Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/storm/pull/286#discussion_r18741401 --- Diff: storm-core/src/jvm/backtype/storm/spout/ShellSpout.java --- @@ -189,9 +205,52 @@ private void handleLog(ShellMsg shellMsg) { @Override public void activate() { + LOG.info("Start checking heartbeat..."); + // prevent timer to check heartbeat based on last thing before activate + setHeartbeat(); + heartBeatTimer.scheduleAtFixedRate(new SpoutHeartbeatTimerTask(this), 1000, 1 * 1000); } @Override public void deactivate() { + heartBeatTimer.cancel(); + } + + private void setHeartbeat() { + lastHeartbeatTimestamp.set(System.currentTimeMillis()); + } + + private long getLastHeartbeat() { + return lastHeartbeatTimestamp.get(); + } + + private void die(Throwable exception) { + heartBeatTimer.cancel(); + + LOG.error("Halting process: ShellSpout died.", exception); + _collector.reportError(exception); + System.exit(11); --- End diff -- @itaifrenkel I agree that we should process.destroy() before terminating itself. (It has been maintained by JDK and it's implemented with JNI, so it would be OS specific.) I also think storm project tries to support Windows, signal handle to SIGTERM maybe not a solution. I'll change it to call process.destroy() first. > ShellBolt keeps sending heartbeats even when child process is hung > ------------------------------------------------------------------ > > Key: STORM-513 > URL: https://issues.apache.org/jira/browse/STORM-513 > Project: Apache Storm > Issue Type: Bug > Environment: Linux: 2.6.32-431.11.2.el6.x86_64 (RHEL 6.5) > Reporter: Dan Blanchard > Priority: Blocker > > If I'm understanding everything correctly with how ShellBolts work, the Java > ShellBolt executor is the part of the topology that sends heartbeats back to > Nimbus to let it know that a particular multilang bolt is still alive. The > problem with this is that if the multilang subprocess/bolt severely hangs > (i.e., it will not even respond to {{SIGALRM}} and the like), the Java > ShellBolt does not seem to notice or care. Simply having the tuple get > replayed when it times out will not suffice either, because the subprocess > will still be stuck. > The most obvious way to handle this seem to be to add heartbeating to the > multilang protocol itself, so that the ShellBolt expects a message of some > kind every {{timeout}} seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)