Thanks, this is very helpful advice.
> On Oct 5, 2015, at 10:29 AM, Bobby Evans <[email protected]> wrote: > > Please check the supervisor log on that node, and also check the worker log > for the worker. If the supervisor prints out a message about ":disallowed" > then nimbus rescheduled it some place else. If it prints out a message about > timed-out then the worker was not responding, and the supervisor relaunched > it thinking it was dead. There are usually two causes for this. 1) it was > dead and you will probably see a lot message in the worker log with the stack > trace for the exception that killed the worker. 2) GC was going crazy on that > worker and it didn't get enough time to actually heartbeat. If it is the > latter you really are going to need to do some profiling. You can test this > by increasing the heap size and seeing if it fixes it, or preferably shutting > off your supervisor and attaching a debugger/taking a heap dump to see where > the memory is being used. If you have a memory leak, increasing the heap > size will not fix it. > - Bobby > > > On Friday, October 2, 2015 2:14 PM, abe oppenheim > <[email protected]> wrote: > > > Hi, > > I'm seeing weird behavior in my topologies and was hoping for some advice > on how to troubleshoot the issue. > > This behavior occurs throughout my topology, but it is easiest to explain > it as the behavior of one bolt. This bolt has 20 executors. When I submit > the topology, the executors are evenly split between 2 hosts. The executors > on one host seem stable, but the Uptime for the executors on the other host > never grows above 10mins-ish, they are constantly being re-prepared. > > I don't know what this is symptomatic of or how to diagnose it. All the > Executors have the same Uptime, so I assume this indicates that their > Worker is dying. > > Any advice on how to troubleshoot this? Possibly a way to tap into the > Worker lifecycle so I can confirm it is dying every few minutes? Possibly > an explanation of why a Worker would die so consistently, and suggestions > about how to approach this? > > Also, any input on how "bad" this is? My topology still processes stuff, > but I assume this constant recreation of Executors has a significant > performance impact? > > thanks, > Abe > >
