[ 
https://issues.apache.org/jira/browse/STORM-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182569#comment-14182569
 ] 

ASF GitHub Bot commented on STORM-513:
--------------------------------------

Github user itaifrenkel commented on a diff in the pull request:

    https://github.com/apache/storm/pull/286#discussion_r19326582
  
    --- Diff: storm-core/src/jvm/backtype/storm/task/ShellBolt.java ---
    @@ -305,4 +283,95 @@ private void die(Throwable exception) {
                 System.exit(11);
             }
         }
    +
    +    private class BoltHeartbeatTimerTask extends TimerTask {
    +        private ShellBolt bolt;
    +
    +        public BoltHeartbeatTimerTask(ShellBolt bolt) {
    +            this.bolt = bolt;
    +        }
    +
    +        @Override
    +        public void run() {
    +            long currentTimeMillis = System.currentTimeMillis();
    +            long lastHeartbeat = getLastHeartbeat();
    +
    +            LOG.debug("BOLT - current time : {}, last heartbeat : {}, 
worker timeout (ms) : {}",
    +                    currentTimeMillis, lastHeartbeat, workerTimeoutMills);
    +
    +            if (currentTimeMillis - lastHeartbeat > workerTimeoutMills) {
    +                bolt.die(new RuntimeException("subprocess heartbeat 
timeout"));
    +            }
    +
    +            String genId = Long.toString(_rand.nextLong());
    +            try {
    +                _pendingWrites.put(createHeartbeatBoltMessage(genId));
    --- End diff --
    
    I reread the code and think that we need here just to flip an atomicboolean 
(a priority queue for heartbeats of size 1). The reason is that the size of the 
_pendingWrites queue is Config.TOPOLOGY_SHELLBOLT_MAX_PENDING which by its name 
is the number of real tuples to retrieve from the disruptor queue. We set it to 
1 to optimize for shortest latency... which would cause this thread to 
block.... which means you cannot share this thread between bolts event if you 
wanted too... which we need to think if this is an issue or not. A stronger 
argument in favor of a priority queue for heartbeats is that the rate of 
heartbeat messages will not be skewed by the length of the queue. 


> ShellBolt keeps sending heartbeats even when child process is hung
> ------------------------------------------------------------------
>
>                 Key: STORM-513
>                 URL: https://issues.apache.org/jira/browse/STORM-513
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.9.2-incubating
>         Environment: Linux: 2.6.32-431.11.2.el6.x86_64 (RHEL 6.5)
>            Reporter: Dan Blanchard
>            Priority: Blocker
>             Fix For: 0.9.3-rc2
>
>
> If I'm understanding everything correctly with how ShellBolts work, the Java 
> ShellBolt executor is the part of the topology that sends heartbeats back to 
> Nimbus to let it know that a particular multilang bolt is still alive.  The 
> problem with this is that if the multilang subprocess/bolt severely hangs 
> (i.e., it will not even respond to {{SIGALRM}} and the like), the Java 
> ShellBolt does not seem to notice or care. Simply having the tuple get 
> replayed when it times out will not suffice either, because the subprocess 
> will still be stuck.
> The most obvious way to handle this seem to be to add heartbeating to the 
> multilang protocol itself, so that the ShellBolt expects a message of some 
> kind every {{timeout}} seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to