[ https://issues.apache.org/jira/browse/STORM-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196872#comment-14196872 ]
ASF GitHub Bot commented on STORM-513: -------------------------------------- Github user HeartSaVioR commented on the pull request: https://github.com/apache/storm/pull/286#issuecomment-61719213 I've got a change to discuss about this PR with @clockfly , and he also stated if subprocess is too busy, subprocess cannot send heartbeat in time, which I've stated first of this PR. Actually it's better to let subprocess have heartbeat thread and send heartbeat periodically. But there're two things to consider. 1. ShellSpout runs with PING-PONG communication, and ShellSpout must wait "sync" from nextTuple(). So if we change ShellSpout to have reader thread, we should implement nextTuple() to wait for reading "sync" from reader thread, which is a little complex than current. 2. We should ensure that main thread and heartbeat thread don't write stdout (maybe Pipe) at the same time. GIL could let us feel free, but there will be other languages that support real (?) thread. Writing operation should be with lock. Since I'm not a Javascript (nodejs) guy, and I'm a beginner to Ruby, I cannot cover two things with .js. So I wish to implement it to other PR when we think we can't stand its limitation, or I have some more time. Btw, Nimbus / Supervisor can find dead process due to subprocess hang up to SUPERVISOR_WORKER_TIMEOUT_SECS * 2 + a (maybe), cause there're two heartbeat check, ShellProcess checks subprocess (and suicide if subprocess cannot respond), Nimbus / Supervisor checks ShellProcess. (Just for @clockfly ) > ShellBolt keeps sending heartbeats even when child process is hung > ------------------------------------------------------------------ > > Key: STORM-513 > URL: https://issues.apache.org/jira/browse/STORM-513 > Project: Apache Storm > Issue Type: Bug > Affects Versions: 0.9.2-incubating > Environment: Linux: 2.6.32-431.11.2.el6.x86_64 (RHEL 6.5) > Reporter: Dan Blanchard > Priority: Blocker > Fix For: 0.9.3-rc2 > > > If I'm understanding everything correctly with how ShellBolts work, the Java > ShellBolt executor is the part of the topology that sends heartbeats back to > Nimbus to let it know that a particular multilang bolt is still alive. The > problem with this is that if the multilang subprocess/bolt severely hangs > (i.e., it will not even respond to {{SIGALRM}} and the like), the Java > ShellBolt does not seem to notice or care. Simply having the tuple get > replayed when it times out will not suffice either, because the subprocess > will still be stuck. > The most obvious way to handle this seem to be to add heartbeating to the > multilang protocol itself, so that the ShellBolt expects a message of some > kind every {{timeout}} seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)