Thanks, this is very helpful advice.

> On Oct 5, 2015, at 10:29 AM, Bobby Evans <[email protected]> wrote:
> 
> Please check the supervisor log on that node, and also check the worker log 
> for the worker.  If the supervisor prints out a message about ":disallowed" 
> then nimbus rescheduled it some place else.  If it prints out a message about 
> timed-out then the worker was not responding, and the supervisor relaunched 
> it thinking it was dead.  There are usually two causes for this.  1) it was 
> dead and you will probably see a lot message in the worker log with the stack 
> trace for the exception that killed the worker. 2) GC was going crazy on that 
> worker and it didn't get enough time to actually heartbeat.  If it is the 
> latter you really are going to need to do some profiling.  You can test this 
> by increasing the heap size and seeing if it fixes it, or preferably shutting 
> off your supervisor and attaching a debugger/taking a heap dump to see where 
> the memory is being used.  If you have a memory leak, increasing the heap 
> size will not fix it.
>  - Bobby 
> 
> 
>     On Friday, October 2, 2015 2:14 PM, abe oppenheim 
> <[email protected]> wrote:
> 
> 
> Hi,
> 
> I'm seeing weird behavior in my topologies and was hoping for some advice
> on how to troubleshoot the issue.
> 
> This behavior occurs throughout my topology, but it is easiest to explain
> it as the behavior of one bolt. This bolt has 20 executors. When I submit
> the topology, the executors are evenly split between 2 hosts. The executors
> on one host seem stable, but the Uptime for the executors on the other host
> never grows above 10mins-ish, they are constantly being re-prepared.
> 
> I don't know what this is symptomatic of or how to diagnose it. All the
> Executors have the same Uptime, so I assume this indicates that their
> Worker is dying.
> 
> Any advice on how to troubleshoot this? Possibly a way to tap into the
> Worker lifecycle so I can confirm it is dying every few minutes? Possibly
> an explanation of why a Worker would die so consistently, and suggestions
> about how to approach this?
> 
> Also, any input on how "bad" this is? My topology still processes stuff,
> but I assume this constant recreation of Executors has a significant
> performance impact?
> 
> thanks,
> Abe
> 
> 

Reply via email to