[jira] [Commented] (STORM-513) ShellBolt keeps sending heartbeats even when child process is hung

ASF GitHub Bot (JIRA) Sat, 11 Oct 2014 05:54:59 -0700

    [ 
https://issues.apache.org/jira/browse/STORM-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168132#comment-14168132
 ]


ASF GitHub Bot commented on STORM-513:
--------------------------------------

Github user HeartSaVioR commented on a diff in the pull request:

    https://github.com/apache/storm/pull/286#discussion_r18741401
  
    --- Diff: storm-core/src/jvm/backtype/storm/spout/ShellSpout.java ---
    @@ -189,9 +205,52 @@ private void handleLog(ShellMsg shellMsg) {
     
         @Override
         public void activate() {
    +        LOG.info("Start checking heartbeat...");
    +        // prevent timer to check heartbeat based on last thing before 
activate
    +        setHeartbeat();
    +        heartBeatTimer.scheduleAtFixedRate(new 
SpoutHeartbeatTimerTask(this), 1000, 1 * 1000);
         }
     
         @Override
         public void deactivate() {
    +        heartBeatTimer.cancel();
    +    }
    +
    +    private void setHeartbeat() {
    +        lastHeartbeatTimestamp.set(System.currentTimeMillis());
    +    }
    +
    +    private long getLastHeartbeat() {
    +        return lastHeartbeatTimestamp.get();
    +    }
    +
    +    private void die(Throwable exception) {
    +        heartBeatTimer.cancel();
    +
    +        LOG.error("Halting process: ShellSpout died.", exception);
    +        _collector.reportError(exception);
    +        System.exit(11);
    --- End diff --
    
    @itaifrenkel I agree that we should process.destroy() before terminating 
itself. 
    (It has been maintained by JDK and it's implemented with JNI, so it would 
be OS specific.)
    I also think storm project tries to support Windows, signal handle to 
SIGTERM maybe not a solution.
    I'll change it to call process.destroy() first.


> ShellBolt keeps sending heartbeats even when child process is hung
> ------------------------------------------------------------------
>
>                 Key: STORM-513
>                 URL: https://issues.apache.org/jira/browse/STORM-513
>             Project: Apache Storm
>          Issue Type: Bug
>         Environment: Linux: 2.6.32-431.11.2.el6.x86_64 (RHEL 6.5)
>            Reporter: Dan Blanchard
>            Priority: Blocker
>
> If I'm understanding everything correctly with how ShellBolts work, the Java 
> ShellBolt executor is the part of the topology that sends heartbeats back to 
> Nimbus to let it know that a particular multilang bolt is still alive.  The 
> problem with this is that if the multilang subprocess/bolt severely hangs 
> (i.e., it will not even respond to {{SIGALRM}} and the like), the Java 
> ShellBolt does not seem to notice or care. Simply having the tuple get 
> replayed when it times out will not suffice either, because the subprocess 
> will still be stuck.
> The most obvious way to handle this seem to be to add heartbeating to the 
> multilang protocol itself, so that the ShellBolt expects a message of some 
> kind every {{timeout}} seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-513) ShellBolt keeps sending heartbeats even when child process is hung

Reply via email to