[ https://issues.apache.org/jira/browse/STORM-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168215#comment-14168215 ]
ASF GitHub Bot commented on STORM-513: -------------------------------------- Github user itaifrenkel commented on a diff in the pull request: https://github.com/apache/storm/pull/286#discussion_r18742240 --- Diff: storm-core/src/jvm/backtype/storm/spout/ShellSpout.java --- @@ -56,13 +66,18 @@ public void open(Map stormConf, TopologyContext context, _collector = collector; _context = context; + workerTimeoutMills = 1000 * RT.intCast(stormConf.get(Config.SUPERVISOR_WORKER_TIMEOUT_SECS)); + _process = new ShellProcess(_command); Number subpid = _process.launch(stormConf, context); LOG.info("Launched subprocess with pid " + subpid); + + heartBeatExecutor = new ScheduledThreadPoolExecutor(5); --- End diff -- 1. You should consult the comitters if they are happy with another thread, or you are requested to use Tick tuples. A pro for another thread is the fact that maybe maybe a multilang bolt would want to use tick tuples too. But still... 2. 1 Thread should be enough. 3. This thread must not halt the process when main exists (as in tests), so it should be daemonized. The way to do it AFAIK is this heartBeanExecutor = MoreExecutors.getExitingScheduledExecutorService(Executors.newScheduledThreadPool(1)) > ShellBolt keeps sending heartbeats even when child process is hung > ------------------------------------------------------------------ > > Key: STORM-513 > URL: https://issues.apache.org/jira/browse/STORM-513 > Project: Apache Storm > Issue Type: Bug > Environment: Linux: 2.6.32-431.11.2.el6.x86_64 (RHEL 6.5) > Reporter: Dan Blanchard > Priority: Blocker > > If I'm understanding everything correctly with how ShellBolts work, the Java > ShellBolt executor is the part of the topology that sends heartbeats back to > Nimbus to let it know that a particular multilang bolt is still alive. The > problem with this is that if the multilang subprocess/bolt severely hangs > (i.e., it will not even respond to {{SIGALRM}} and the like), the Java > ShellBolt does not seem to notice or care. Simply having the tuple get > replayed when it times out will not suffice either, because the subprocess > will still be stuck. > The most obvious way to handle this seem to be to add heartbeating to the > multilang protocol itself, so that the ShellBolt expects a message of some > kind every {{timeout}} seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)