what may be helpful to dive into this a bit more is "pyrasite" . You need gdb installed on the machine, but afterwards you can attach to a running process and then use python "payloads" to investigate what's going on, for example dump the stack trace per threads:
http://pyrasite.readthedocs.io/en/latest/Shell.html http://pyrasite.readthedocs.io/en/latest/Payloads.html Obviously this is only useful if the process is still around, not in cases of thread death. There are some potential issues depending on your platform: http://pyrasite.readthedocs.io/en/latest/Installing.html and additionally check for database deadlocks for your setup. If we can confirm that a thread doesn't simply disappear or becomes a zombie, then we can develop a payload together that attempts to identify the cause. By the way, I remember that the scheduler would only spawn one or three processes, but I may be wrong. Right now when I start, it spawns 7 separate processes for the scheduler (8 total) with some additional ones spawned when the dag file processor starts. I'm going for a walk now in beautiful weather, but I intend to research that further. Anyone confirm that indeed it should be that many with an explanation why that happens? Rgds, Gerard On Sun, Mar 26, 2017 at 3:21 AM, Bolke de Bruin <bdbr...@gmail.com> wrote: > I case you *think* you have encountered a schedule *hang*, please provide > a strace on the parent process, provide process list output that shows > defunct scheduler processes, and provide *all* logging (main logs, > scheduler processing log, task logs), preferably in debug mode > (settings.py). Also show memory limits, cpu count and airflow.cfg. > > Thanks > Bolke > > > > On 25 Mar 2017, at 18:16, Bolke de Bruin <bdbr...@gmail.com> wrote: > > > > Please specify what “stop doing its job” means. It doesn’t log anything > anymore? If it does, the scheduler hasn’t died and hasn’t stopped. > > > > B. > > > > > >> On 24 Mar 2017, at 18:20, Gael Magnan <gaelmag...@gmail.com> wrote: > >> > >> We encountered the same kind of problem with the scheduler that stopped > >> doing its job even after rebooting. I thought changing the start date or > >> the state of a task instance might be to blame but I've never been able > to > >> pinpoint the problem either. > >> > >> We are using celery and docker if it helps. > >> > >> Le sam. 25 mars 2017 à 01:53, Bolke de Bruin <bdbr...@gmail.com> a > écrit : > >> > >>> We are running *without* num runs for over a year (and never have). It > is > >>> a very elusive issue which has not been reproducible. > >>> > >>> I like more info on this but it needs to be very elaborate even to the > >>> point of access to the system exposing the behavior. > >>> > >>> Bolke > >>> > >>> Sent from my iPhone > >>> > >>>> On 24 Mar 2017, at 16:04, Vijay Ramesh <vi...@change.org> wrote: > >>>> > >>>> We literally have a cron job that restarts the scheduler every 30 min. > >>> Num > >>>> runs didn't work consistently in rc4, sometimes it would restart > itself > >>> and > >>>> sometimes we'd end up with a few zombie scheduler processes and things > >>>> would get stuck. Also running locally, without celery. > >>>> > >>>>> On Mar 24, 2017 16:02, <lro...@quartethealth.com> wrote: > >>>>> > >>>>> We have max runs set and still hit this. Our solution is dumber: > >>>>> monitoring log output, and kill the scheduler if it stops emitting. > >>> Works > >>>>> like a charm. > >>>>> > >>>>>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu <fhakan.ko...@gmail.com > > > >>>>> wrote: > >>>>>> > >>>>>> Some solutions to this problem is restarting the scheduler > frequently > >>> or > >>>>>> some sort of monitoring on the scheduler. We have set up a dag that > >>> pings > >>>>>> cronitor <https://cronitor.io/> (a dead man's snitch type of > service) > >>>>> every > >>>>>> 10 minutes and the snitch pages you when the scheduler dies and does > >>> not > >>>>>> send a ping to it. > >>>>>> > >>>>>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips < > >>> aphill...@qrmedia.com> > >>>>>> wrote: > >>>>>> > >>>>>>> We use celery and run into it from time to time. > >>>>>>>> > >>>>>>> > >>>>>>> Bang goes my theory ;-) At least, assuming it's the same underlying > >>>>>>> cause... > >>>>>>> > >>>>>>> Regards > >>>>>>> > >>>>>>> ap > >>>>>>> > >>>>> > >>> > > > >