Is this: 1. On 1.8.0? 1.7.1 is not supported anymore. 2. localexecutor?
Your are running with N=10, can you try running without it? B. Sent from my iPhone > On 27 Mar 2017, at 10:28, Nicholas Hodgkinson > <nik.hodgkin...@collectivehealth.com> wrote: > > Ok, I'm not sure how helpful this is and I'm working on getting some more > information, but here's some preliminary data: > > Process tree (`ps axjf`): > 1 2391 2391 2391 ? -1 Ssl 999 0:13 /usr/bin/python > usr/local/bin/airflow scheduler -n 10 > 2391 2435 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2436 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2437 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2438 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2439 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2440 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2441 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2442 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2443 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2444 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2454 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2456 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2457 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2458 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2459 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2460 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2461 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2462 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2463 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2464 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2465 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > 2391 2466 2391 2391 ? -1 Z 999 0:00 \_ > [/usr/bin/python] <defunct> > > # gdb python 2391 > Reading symbols from python...Reading symbols from > /usr/lib/debug//usr/bin/python2.7...done. > done. > Attaching to program: /usr/bin/python, process 2391 > Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from > /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done. > done. > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > 0x00007f0c1bbb9670 in ?? () > (gdb) bt > #0 0x00007f0c1bbb9670 in ?? () > #1 0x00007f0c1bf1a000 in ?? () > #2 0x00007f0c12099b45 in ?? () > #3 0x00000000032dbe00 in ?? () > #4 0x0000000000000000 in ?? () > (gdb) py-bt > (gdb) py-list > Unable to locate python frame > > I know that's not super helpful, but it's information; I've also tried > pyrasite, but got nothing from it of any use. This problem occurs for me > very often and I'm happy to provide a modified environment in which to > capture info if anyone has a suggestion. For now I need to restart my > process and get my jobs running again. > > -N > nik.hodgkin...@collectivehealth.com > > > On Sun, Mar 26, 2017 at 7:48 AM, Gerard Toonstra <gtoons...@gmail.com> > wrote: > >>> >>> >>> By the way, I remember that the scheduler would only spawn one or three >>> processes, but I may be wrong. >>> Right now when I start, it spawns 7 separate processes for the scheduler >>> (8 total) with some additional >>> ones spawned when the dag file processor starts. >>> >>> >> These other processes were executor processes. Hopefully with the tips >> below someone who's getting this error >> regularly can attach and dump the thread stack and we see what's going on. >> >> Rgds, >> >> Gerard >> >> >>> >>> On Sun, Mar 26, 2017 at 3:21 AM, Bolke de Bruin <bdbr...@gmail.com> >> wrote: >>> >>>> I case you *think* you have encountered a schedule *hang*, please >> provide >>>> a strace on the parent process, provide process list output that shows >>>> defunct scheduler processes, and provide *all* logging (main logs, >>>> scheduler processing log, task logs), preferably in debug mode >>>> (settings.py). Also show memory limits, cpu count and airflow.cfg. >>>> >>>> Thanks >>>> Bolke >>>> >>>> >>>>> On 25 Mar 2017, at 18:16, Bolke de Bruin <bdbr...@gmail.com> wrote: >>>>> >>>>> Please specify what “stop doing its job” means. It doesn’t log >> anything >>>> anymore? If it does, the scheduler hasn’t died and hasn’t stopped. >>>>> >>>>> B. >>>>> >>>>> >>>>>> On 24 Mar 2017, at 18:20, Gael Magnan <gaelmag...@gmail.com> wrote: >>>>>> >>>>>> We encountered the same kind of problem with the scheduler that >> stopped >>>>>> doing its job even after rebooting. I thought changing the start date >>>> or >>>>>> the state of a task instance might be to blame but I've never been >>>> able to >>>>>> pinpoint the problem either. >>>>>> >>>>>> We are using celery and docker if it helps. >>>>>> >>>>>> Le sam. 25 mars 2017 à 01:53, Bolke de Bruin <bdbr...@gmail.com> a >>>> écrit : >>>>>> >>>>>>> We are running *without* num runs for over a year (and never have). >>>> It is >>>>>>> a very elusive issue which has not been reproducible. >>>>>>> >>>>>>> I like more info on this but it needs to be very elaborate even to >> the >>>>>>> point of access to the system exposing the behavior. >>>>>>> >>>>>>> Bolke >>>>>>> >>>>>>> Sent from my iPhone >>>>>>> >>>>>>>> On 24 Mar 2017, at 16:04, Vijay Ramesh <vi...@change.org> wrote: >>>>>>>> >>>>>>>> We literally have a cron job that restarts the scheduler every 30 >>>> min. >>>>>>> Num >>>>>>>> runs didn't work consistently in rc4, sometimes it would restart >>>> itself >>>>>>> and >>>>>>>> sometimes we'd end up with a few zombie scheduler processes and >>>> things >>>>>>>> would get stuck. Also running locally, without celery. >>>>>>>> >>>>>>>>> On Mar 24, 2017 16:02, <lro...@quartethealth.com> wrote: >>>>>>>>> >>>>>>>>> We have max runs set and still hit this. Our solution is dumber: >>>>>>>>> monitoring log output, and kill the scheduler if it stops >> emitting. >>>>>>> Works >>>>>>>>> like a charm. >>>>>>>>> >>>>>>>>>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu < >>>> fhakan.ko...@gmail.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Some solutions to this problem is restarting the scheduler >>>> frequently >>>>>>> or >>>>>>>>>> some sort of monitoring on the scheduler. We have set up a dag >> that >>>>>>> pings >>>>>>>>>> cronitor <https://cronitor.io/> (a dead man's snitch type of >>>> service) >>>>>>>>> every >>>>>>>>>> 10 minutes and the snitch pages you when the scheduler dies and >>>> does >>>>>>> not >>>>>>>>>> send a ping to it. >>>>>>>>>> >>>>>>>>>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips < >>>>>>> aphill...@qrmedia.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> We use celery and run into it from time to time. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Bang goes my theory ;-) At least, assuming it's the same >>>> underlying >>>>>>>>>>> cause... >>>>>>>>>>> >>>>>>>>>>> Regards >>>>>>>>>>> >>>>>>>>>>> ap >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>>> >>>> >>> >> > > -- > > > Read our founder's story. > <https://collectivehealth.com/blog/started-collective-health/> > > *This message may contain confidential, proprietary, or protected > information. If you are not the intended recipient, you may not review, > copy, or distribute this message. If you received this message in error, > please notify the sender by reply email and delete this message.*