[ https://issues.apache.org/jira/browse/STORM-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163526#comment-14163526 ]
ASF GitHub Bot commented on STORM-513: -------------------------------------- GitHub user HeartSaVioR opened a pull request: https://github.com/apache/storm/pull/286 STORM-513 check heartbeat from multilang subprocess Related issue link : https://issues.apache.org/jira/browse/STORM-513 It seems that ShellSpout and ShellBolt doesn't check subprocess, and set heartbeat with their only states. Subprocess could hang, but it doesn't affect ShellSpout / ShellBolt. It just stops working on tuple. It's better to check heartbeat from subprocess, and suicide if subprocess stops working. * Spout * ShellSpout sends "next" to subprocess continuously * subprocess sends "sync" to ShellSpout when "next" is received * so we can treat "sync", or any messages to heartbeat * Bolt * ShellBolt sends tuples to subprocess if it's available * so we need to send "heartbeat" tuple * subprocess sends "sync" to ShellBolt when "heartbeat" tuple is received You can merge this pull request into a Git repository by running: $ git pull https://github.com/HeartSaVioR/storm STORM-513 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/storm/pull/286.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #286 ---- commit ca5874cdf11af8d835335d228b643f28aeb3f9c3 Author: Jungtaek Lim <kabh...@gmail.com> Date: 2014-10-08T13:48:01Z STORM-513 check heartbeat from multilang subprocess * Spout ** ShellSpout sends "next" to subprocess continuously ** subprocess sends "sync" to ShellSpout when "next" is received ** so we can treat "sync", or any messages to heartbeat * Bolt ** ShellBolt sends tuples to subprocess if it's available ** so we need to send "heartbeat" tuple ** subprocess sends "sync" to ShellBolt when "heartbeat" tuple is received commit 1a0d4bdd735ba0ade42f6777a4c47affec931557 Author: Jungtaek Lim <kabh...@gmail.com> Date: 2014-10-08T14:06:24Z Fix mixed tab / space, remove FIXME ---- > ShellBolt keeps sending heartbeats even when child process is hung > ------------------------------------------------------------------ > > Key: STORM-513 > URL: https://issues.apache.org/jira/browse/STORM-513 > Project: Apache Storm > Issue Type: Bug > Environment: Linux: 2.6.32-431.11.2.el6.x86_64 (RHEL 6.5) > Reporter: Dan Blanchard > Priority: Blocker > > If I'm understanding everything correctly with how ShellBolts work, the Java > ShellBolt executor is the part of the topology that sends heartbeats back to > Nimbus to let it know that a particular multilang bolt is still alive. The > problem with this is that if the multilang subprocess/bolt severely hangs > (i.e., it will not even respond to {{SIGALRM}} and the like), the Java > ShellBolt does not seem to notice or care. Simply having the tuple get > replayed when it times out will not suffice either, because the subprocess > will still be stuck. > The most obvious way to handle this seem to be to add heartbeating to the > multilang protocol itself, so that the ShellBolt expects a message of some > kind every {{timeout}} seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)