1.7.1.3, however it seems this is still an issue in 1.8 according to other posters. I'll upgrade today. Yes, localexecutor. Will remove -n 10
-N nik.hodgkin...@collectivehealth.com On Mon, Mar 27, 2017 at 10:40 AM, Bolke de Bruin <bdbr...@gmail.com> wrote: > Is this: > > 1. On 1.8.0? 1.7.1 is not supported anymore. > 2. localexecutor? > > Your are running with N=10, can you try running without it? > > B. > > Sent from my iPhone > > > On 27 Mar 2017, at 10:28, Nicholas Hodgkinson <nik.hodgkinson@ > collectivehealth.com> wrote: > > > > Ok, I'm not sure how helpful this is and I'm working on getting some more > > information, but here's some preliminary data: > > > > Process tree (`ps axjf`): > > 1 2391 2391 2391 ? -1 Ssl 999 0:13 /usr/bin/python > > usr/local/bin/airflow scheduler -n 10 > > 2391 2435 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2436 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2437 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2438 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2439 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2440 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2441 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2442 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2443 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2444 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2454 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2456 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2457 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2458 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2459 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2460 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2461 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2462 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2463 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2464 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2465 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > 2391 2466 2391 2391 ? -1 Z 999 0:00 \_ > > [/usr/bin/python] <defunct> > > > > # gdb python 2391 > > Reading symbols from python...Reading symbols from > > /usr/lib/debug//usr/bin/python2.7...done. > > done. > > Attaching to program: /usr/bin/python, process 2391 > > Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from > > /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done. > > done. > > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > > 0x00007f0c1bbb9670 in ?? () > > (gdb) bt > > #0 0x00007f0c1bbb9670 in ?? () > > #1 0x00007f0c1bf1a000 in ?? () > > #2 0x00007f0c12099b45 in ?? () > > #3 0x00000000032dbe00 in ?? () > > #4 0x0000000000000000 in ?? () > > (gdb) py-bt > > (gdb) py-list > > Unable to locate python frame > > > > I know that's not super helpful, but it's information; I've also tried > > pyrasite, but got nothing from it of any use. This problem occurs for me > > very often and I'm happy to provide a modified environment in which to > > capture info if anyone has a suggestion. For now I need to restart my > > process and get my jobs running again. > > > > -N > > nik.hodgkin...@collectivehealth.com > > > > > > On Sun, Mar 26, 2017 at 7:48 AM, Gerard Toonstra <gtoons...@gmail.com> > > wrote: > > > >>> > >>> > >>> By the way, I remember that the scheduler would only spawn one or three > >>> processes, but I may be wrong. > >>> Right now when I start, it spawns 7 separate processes for the > scheduler > >>> (8 total) with some additional > >>> ones spawned when the dag file processor starts. > >>> > >>> > >> These other processes were executor processes. Hopefully with the tips > >> below someone who's getting this error > >> regularly can attach and dump the thread stack and we see what's going > on. > >> > >> Rgds, > >> > >> Gerard > >> > >> > >>> > >>> On Sun, Mar 26, 2017 at 3:21 AM, Bolke de Bruin <bdbr...@gmail.com> > >> wrote: > >>> > >>>> I case you *think* you have encountered a schedule *hang*, please > >> provide > >>>> a strace on the parent process, provide process list output that shows > >>>> defunct scheduler processes, and provide *all* logging (main logs, > >>>> scheduler processing log, task logs), preferably in debug mode > >>>> (settings.py). Also show memory limits, cpu count and airflow.cfg. > >>>> > >>>> Thanks > >>>> Bolke > >>>> > >>>> > >>>>> On 25 Mar 2017, at 18:16, Bolke de Bruin <bdbr...@gmail.com> wrote: > >>>>> > >>>>> Please specify what “stop doing its job” means. It doesn’t log > >> anything > >>>> anymore? If it does, the scheduler hasn’t died and hasn’t stopped. > >>>>> > >>>>> B. > >>>>> > >>>>> > >>>>>> On 24 Mar 2017, at 18:20, Gael Magnan <gaelmag...@gmail.com> wrote: > >>>>>> > >>>>>> We encountered the same kind of problem with the scheduler that > >> stopped > >>>>>> doing its job even after rebooting. I thought changing the start > date > >>>> or > >>>>>> the state of a task instance might be to blame but I've never been > >>>> able to > >>>>>> pinpoint the problem either. > >>>>>> > >>>>>> We are using celery and docker if it helps. > >>>>>> > >>>>>> Le sam. 25 mars 2017 à 01:53, Bolke de Bruin <bdbr...@gmail.com> a > >>>> écrit : > >>>>>> > >>>>>>> We are running *without* num runs for over a year (and never have). > >>>> It is > >>>>>>> a very elusive issue which has not been reproducible. > >>>>>>> > >>>>>>> I like more info on this but it needs to be very elaborate even to > >> the > >>>>>>> point of access to the system exposing the behavior. > >>>>>>> > >>>>>>> Bolke > >>>>>>> > >>>>>>> Sent from my iPhone > >>>>>>> > >>>>>>>> On 24 Mar 2017, at 16:04, Vijay Ramesh <vi...@change.org> wrote: > >>>>>>>> > >>>>>>>> We literally have a cron job that restarts the scheduler every 30 > >>>> min. > >>>>>>> Num > >>>>>>>> runs didn't work consistently in rc4, sometimes it would restart > >>>> itself > >>>>>>> and > >>>>>>>> sometimes we'd end up with a few zombie scheduler processes and > >>>> things > >>>>>>>> would get stuck. Also running locally, without celery. > >>>>>>>> > >>>>>>>>> On Mar 24, 2017 16:02, <lro...@quartethealth.com> wrote: > >>>>>>>>> > >>>>>>>>> We have max runs set and still hit this. Our solution is dumber: > >>>>>>>>> monitoring log output, and kill the scheduler if it stops > >> emitting. > >>>>>>> Works > >>>>>>>>> like a charm. > >>>>>>>>> > >>>>>>>>>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu < > >>>> fhakan.ko...@gmail.com> > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> Some solutions to this problem is restarting the scheduler > >>>> frequently > >>>>>>> or > >>>>>>>>>> some sort of monitoring on the scheduler. We have set up a dag > >> that > >>>>>>> pings > >>>>>>>>>> cronitor <https://cronitor.io/> (a dead man's snitch type of > >>>> service) > >>>>>>>>> every > >>>>>>>>>> 10 minutes and the snitch pages you when the scheduler dies and > >>>> does > >>>>>>> not > >>>>>>>>>> send a ping to it. > >>>>>>>>>> > >>>>>>>>>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips < > >>>>>>> aphill...@qrmedia.com> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> We use celery and run into it from time to time. > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Bang goes my theory ;-) At least, assuming it's the same > >>>> underlying > >>>>>>>>>>> cause... > >>>>>>>>>>> > >>>>>>>>>>> Regards > >>>>>>>>>>> > >>>>>>>>>>> ap > >>>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>> > >>>> > >>>> > >>> > >> > > > > -- > > > > > > Read our founder's story. > > <https://collectivehealth.com/blog/started-collective-health/> > > > > *This message may contain confidential, proprietary, or protected > > information. If you are not the intended recipient, you may not review, > > copy, or distribute this message. If you received this message in error, > > please notify the sender by reply email and delete this message.* > -- Read our founder's story. <https://collectivehealth.com/blog/started-collective-health/> *This message may contain confidential, proprietary, or protected information. If you are not the intended recipient, you may not review, copy, or distribute this message. If you received this message in error, please notify the sender by reply email and delete this message.*