[ 
https://issues.apache.org/jira/browse/STORM-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168106#comment-14168106
 ] 

ASF GitHub Bot commented on STORM-513:
--------------------------------------

Github user itaifrenkel commented on a diff in the pull request:

    https://github.com/apache/storm/pull/286#discussion_r18741180
  
    --- Diff: storm-core/src/jvm/backtype/storm/spout/ShellSpout.java ---
    @@ -189,9 +205,52 @@ private void handleLog(ShellMsg shellMsg) {
     
         @Override
         public void activate() {
    +        LOG.info("Start checking heartbeat...");
    +        // prevent timer to check heartbeat based on last thing before 
activate
    +        setHeartbeat();
    +        heartBeatTimer.scheduleAtFixedRate(new 
SpoutHeartbeatTimerTask(this), 1000, 1 * 1000);
         }
     
         @Override
         public void deactivate() {
    +        heartBeatTimer.cancel();
    +    }
    +
    +    private void setHeartbeat() {
    +        lastHeartbeatTimestamp.set(System.currentTimeMillis());
    +    }
    +
    +    private long getLastHeartbeat() {
    +        return lastHeartbeatTimestamp.get();
    +    }
    +
    +    private void die(Throwable exception) {
    +        heartBeatTimer.cancel();
    +
    +        LOG.error("Halting process: ShellSpout died.", exception);
    +        _collector.reportError(exception);
    +        System.exit(11);
    --- End diff --
    
    All of our pyton and multilang bolts have special code that intercepts the 
SIG_TERM singal and kill when parent process dies. This has not been 
contributed back since it is very linux specific and logger specific. Without 
it you might end up having zomie worker processes. This does not relate to your 
commit since you didn't invent the System.exit(11) thingy, however it would 
make things worse when a process is not responding. Ideally you would at least 
want to call process.destory() first. As process destroy is implemented without 
kill -9 it is not guaranteed to work (sigar's implements this per OS quite 
nicely).



> ShellBolt keeps sending heartbeats even when child process is hung
> ------------------------------------------------------------------
>
>                 Key: STORM-513
>                 URL: https://issues.apache.org/jira/browse/STORM-513
>             Project: Apache Storm
>          Issue Type: Bug
>         Environment: Linux: 2.6.32-431.11.2.el6.x86_64 (RHEL 6.5)
>            Reporter: Dan Blanchard
>            Priority: Blocker
>
> If I'm understanding everything correctly with how ShellBolts work, the Java 
> ShellBolt executor is the part of the topology that sends heartbeats back to 
> Nimbus to let it know that a particular multilang bolt is still alive.  The 
> problem with this is that if the multilang subprocess/bolt severely hangs 
> (i.e., it will not even respond to {{SIGALRM}} and the like), the Java 
> ShellBolt does not seem to notice or care. Simply having the tuple get 
> replayed when it times out will not suffice either, because the subprocess 
> will still be stuck.
> The most obvious way to handle this seem to be to add heartbeating to the 
> multilang protocol itself, so that the ShellBolt expects a message of some 
> kind every {{timeout}} seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to