Please check the supervisor log on that node, and also check the worker log for
the worker. If the supervisor prints out a message about ":disallowed" then
nimbus rescheduled it some place else. If it prints out a message about
timed-out then the worker was not responding, and the supervisor relaunched it
thinking it was dead. There are usually two causes for this. 1) it was dead
and you will probably see a lot message in the worker log with the stack trace
for the exception that killed the worker. 2) GC was going crazy on that worker
and it didn't get enough time to actually heartbeat. If it is the latter you
really are going to need to do some profiling. You can test this by increasing
the heap size and seeing if it fixes it, or preferably shutting off your
supervisor and attaching a debugger/taking a heap dump to see where the memory
is being used. If you have a memory leak, increasing the heap size will not
fix it.
- Bobby
On Friday, October 2, 2015 2:14 PM, abe oppenheim
<[email protected]> wrote:
Hi,
I'm seeing weird behavior in my topologies and was hoping for some advice
on how to troubleshoot the issue.
This behavior occurs throughout my topology, but it is easiest to explain
it as the behavior of one bolt. This bolt has 20 executors. When I submit
the topology, the executors are evenly split between 2 hosts. The executors
on one host seem stable, but the Uptime for the executors on the other host
never grows above 10mins-ish, they are constantly being re-prepared.
I don't know what this is symptomatic of or how to diagnose it. All the
Executors have the same Uptime, so I assume this indicates that their
Worker is dying.
Any advice on how to troubleshoot this? Possibly a way to tap into the
Worker lifecycle so I can confirm it is dying every few minutes? Possibly
an explanation of why a Worker would die so consistently, and suggestions
about how to approach this?
Also, any input on how "bad" this is? My topology still processes stuff,
but I assume this constant recreation of Executors has a significant
performance impact?
thanks,
Abe