1.8: increasing DAGBAG_IMPORT_TIMEOUT helps. I don't see the issue (although not sure why tasks progress has become slow? But thats not the issue we are discussing here. So I am ignoring that here)
1.7: our prod is running 1.7 and we havent seen the "defunct process" issue for more than a week now. But we saw something very close to what Nicholas provided (localexecutor, we do not use --num-runs) Not sure if cpu/memory limit may lead to this issue. Often when we hit this issue (which stalled the pipeline), we either increased the memory and/or moved airflow to a bulkier (cpu) instance. Sorry for a late reply. Was out of town over the weekend. On Mon, Mar 27, 2017 at 10:47 AM, Nicholas Hodgkinson < nik.hodgkin...@collectivehealth.com> wrote: > 1.7.1.3, however it seems this is still an issue in 1.8 according to other > posters. I'll upgrade today. > Yes, localexecutor. > Will remove -n 10 > > -N > nik.hodgkin...@collectivehealth.com > > > On Mon, Mar 27, 2017 at 10:40 AM, Bolke de Bruin <bdbr...@gmail.com> > wrote: > > > Is this: > > > > 1. On 1.8.0? 1.7.1 is not supported anymore. > > 2. localexecutor? > > > > Your are running with N=10, can you try running without it? > > > > B. > > > > Sent from my iPhone > > > > > On 27 Mar 2017, at 10:28, Nicholas Hodgkinson <nik.hodgkinson@ > > collectivehealth.com> wrote: > > > > > > Ok, I'm not sure how helpful this is and I'm working on getting some > more > > > information, but here's some preliminary data: > > > > > > Process tree (`ps axjf`): > > > 1 2391 2391 2391 ? -1 Ssl 999 0:13 /usr/bin/python > > > usr/local/bin/airflow scheduler -n 10 > > > 2391 2435 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2436 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2437 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2438 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2439 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2440 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2441 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2442 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2443 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2444 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2454 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2456 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2457 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2458 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2459 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2460 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2461 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2462 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2463 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2464 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2465 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > 2391 2466 2391 2391 ? -1 Z 999 0:00 \_ > > > [/usr/bin/python] <defunct> > > > > > > # gdb python 2391 > > > Reading symbols from python...Reading symbols from > > > /usr/lib/debug//usr/bin/python2.7...done. > > > done. > > > Attaching to program: /usr/bin/python, process 2391 > > > Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols > from > > > /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done. > > > done. > > > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > > > 0x00007f0c1bbb9670 in ?? () > > > (gdb) bt > > > #0 0x00007f0c1bbb9670 in ?? () > > > #1 0x00007f0c1bf1a000 in ?? () > > > #2 0x00007f0c12099b45 in ?? () > > > #3 0x00000000032dbe00 in ?? () > > > #4 0x0000000000000000 in ?? () > > > (gdb) py-bt > > > (gdb) py-list > > > Unable to locate python frame > > > > > > I know that's not super helpful, but it's information; I've also tried > > > pyrasite, but got nothing from it of any use. This problem occurs for > me > > > very often and I'm happy to provide a modified environment in which to > > > capture info if anyone has a suggestion. For now I need to restart my > > > process and get my jobs running again. > > > > > > -N > > > nik.hodgkin...@collectivehealth.com > > > > > > > > > On Sun, Mar 26, 2017 at 7:48 AM, Gerard Toonstra <gtoons...@gmail.com> > > > wrote: > > > > > >>> > > >>> > > >>> By the way, I remember that the scheduler would only spawn one or > three > > >>> processes, but I may be wrong. > > >>> Right now when I start, it spawns 7 separate processes for the > > scheduler > > >>> (8 total) with some additional > > >>> ones spawned when the dag file processor starts. > > >>> > > >>> > > >> These other processes were executor processes. Hopefully with the tips > > >> below someone who's getting this error > > >> regularly can attach and dump the thread stack and we see what's going > > on. > > >> > > >> Rgds, > > >> > > >> Gerard > > >> > > >> > > >>> > > >>> On Sun, Mar 26, 2017 at 3:21 AM, Bolke de Bruin <bdbr...@gmail.com> > > >> wrote: > > >>> > > >>>> I case you *think* you have encountered a schedule *hang*, please > > >> provide > > >>>> a strace on the parent process, provide process list output that > shows > > >>>> defunct scheduler processes, and provide *all* logging (main logs, > > >>>> scheduler processing log, task logs), preferably in debug mode > > >>>> (settings.py). Also show memory limits, cpu count and airflow.cfg. > > >>>> > > >>>> Thanks > > >>>> Bolke > > >>>> > > >>>> > > >>>>> On 25 Mar 2017, at 18:16, Bolke de Bruin <bdbr...@gmail.com> > wrote: > > >>>>> > > >>>>> Please specify what “stop doing its job” means. It doesn’t log > > >> anything > > >>>> anymore? If it does, the scheduler hasn’t died and hasn’t stopped. > > >>>>> > > >>>>> B. > > >>>>> > > >>>>> > > >>>>>> On 24 Mar 2017, at 18:20, Gael Magnan <gaelmag...@gmail.com> > wrote: > > >>>>>> > > >>>>>> We encountered the same kind of problem with the scheduler that > > >> stopped > > >>>>>> doing its job even after rebooting. I thought changing the start > > date > > >>>> or > > >>>>>> the state of a task instance might be to blame but I've never been > > >>>> able to > > >>>>>> pinpoint the problem either. > > >>>>>> > > >>>>>> We are using celery and docker if it helps. > > >>>>>> > > >>>>>> Le sam. 25 mars 2017 à 01:53, Bolke de Bruin <bdbr...@gmail.com> > a > > >>>> écrit : > > >>>>>> > > >>>>>>> We are running *without* num runs for over a year (and never > have). > > >>>> It is > > >>>>>>> a very elusive issue which has not been reproducible. > > >>>>>>> > > >>>>>>> I like more info on this but it needs to be very elaborate even > to > > >> the > > >>>>>>> point of access to the system exposing the behavior. > > >>>>>>> > > >>>>>>> Bolke > > >>>>>>> > > >>>>>>> Sent from my iPhone > > >>>>>>> > > >>>>>>>> On 24 Mar 2017, at 16:04, Vijay Ramesh <vi...@change.org> > wrote: > > >>>>>>>> > > >>>>>>>> We literally have a cron job that restarts the scheduler every > 30 > > >>>> min. > > >>>>>>> Num > > >>>>>>>> runs didn't work consistently in rc4, sometimes it would restart > > >>>> itself > > >>>>>>> and > > >>>>>>>> sometimes we'd end up with a few zombie scheduler processes and > > >>>> things > > >>>>>>>> would get stuck. Also running locally, without celery. > > >>>>>>>> > > >>>>>>>>> On Mar 24, 2017 16:02, <lro...@quartethealth.com> wrote: > > >>>>>>>>> > > >>>>>>>>> We have max runs set and still hit this. Our solution is > dumber: > > >>>>>>>>> monitoring log output, and kill the scheduler if it stops > > >> emitting. > > >>>>>>> Works > > >>>>>>>>> like a charm. > > >>>>>>>>> > > >>>>>>>>>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu < > > >>>> fhakan.ko...@gmail.com> > > >>>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>> Some solutions to this problem is restarting the scheduler > > >>>> frequently > > >>>>>>> or > > >>>>>>>>>> some sort of monitoring on the scheduler. We have set up a dag > > >> that > > >>>>>>> pings > > >>>>>>>>>> cronitor <https://cronitor.io/> (a dead man's snitch type of > > >>>> service) > > >>>>>>>>> every > > >>>>>>>>>> 10 minutes and the snitch pages you when the scheduler dies > and > > >>>> does > > >>>>>>> not > > >>>>>>>>>> send a ping to it. > > >>>>>>>>>> > > >>>>>>>>>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips < > > >>>>>>> aphill...@qrmedia.com> > > >>>>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>>> We use celery and run into it from time to time. > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Bang goes my theory ;-) At least, assuming it's the same > > >>>> underlying > > >>>>>>>>>>> cause... > > >>>>>>>>>>> > > >>>>>>>>>>> Regards > > >>>>>>>>>>> > > >>>>>>>>>>> ap > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>> > > >>>>> > > >>>> > > >>>> > > >>> > > >> > > > > > > -- > > > > > > > > > Read our founder's story. > > > <https://collectivehealth.com/blog/started-collective-health/> > > > > > > *This message may contain confidential, proprietary, or protected > > > information. If you are not the intended recipient, you may not > review, > > > copy, or distribute this message. If you received this message in > error, > > > please notify the sender by reply email and delete this message.* > > > > -- > > > Read our founder's story. > <https://collectivehealth.com/blog/started-collective-health/> > > *This message may contain confidential, proprietary, or protected > information. If you are not the intended recipient, you may not review, > copy, or distribute this message. If you received this message in error, > please notify the sender by reply email and delete this message.* >