Re: Scheduler silently dies

harish singh Mon, 27 Mar 2017 11:14:13 -0700

1.8:  increasing DAGBAG_IMPORT_TIMEOUT helps. I don't see the issue
(although not sure why tasks progress has become slow? But thats not the
issue we are discussing here. So I am ignoring that here)


1.7:  our prod is running 1.7 and we havent seen the "defunct process"
issue for more than a week now. But we saw something very close to what
Nicholas provided (localexecutor, we do not use --num-runs)
Not sure if cpu/memory limit may lead to this issue. Often when we hit this
issue (which stalled the pipeline), we either increased the memory and/or
moved airflow to a bulkier (cpu) instance.

Sorry for a late reply. Was out of town over the weekend.



On Mon, Mar 27, 2017 at 10:47 AM, Nicholas Hodgkinson <
nik.hodgkin...@collectivehealth.com> wrote:

> 1.7.1.3, however it seems this is still an issue in 1.8 according to other
> posters. I'll upgrade today.
> Yes, localexecutor.
> Will remove -n 10
>
> -N
> nik.hodgkin...@collectivehealth.com
>
>
> On Mon, Mar 27, 2017 at 10:40 AM, Bolke de Bruin <bdbr...@gmail.com>
> wrote:
>
> > Is this:
> >
> > 1. On 1.8.0? 1.7.1 is not supported anymore.
> > 2. localexecutor?
> >
> > Your are running with N=10, can you try running without it?
> >
> > B.
> >
> > Sent from my iPhone
> >
> > > On 27 Mar 2017, at 10:28, Nicholas Hodgkinson <nik.hodgkinson@
> > collectivehealth.com> wrote:
> > >
> > > Ok, I'm not sure how helpful this is and I'm working on getting some
> more
> > > information, but here's some preliminary data:
> > >
> > > Process tree (`ps axjf`):
> > >    1  2391  2391  2391 ?           -1 Ssl    999   0:13 /usr/bin/python
> > > usr/local/bin/airflow scheduler -n 10
> > > 2391  2435  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2436  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2437  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2438  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2439  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2440  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2441  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2442  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2443  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2444  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2454  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2456  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2457  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2458  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2459  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2460  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2461  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2462  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2463  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2464  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2465  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > > 2391  2466  2391  2391 ?           -1 Z      999   0:00  \_
> > > [/usr/bin/python] <defunct>
> > >
> > > # gdb python 2391
> > > Reading symbols from python...Reading symbols from
> > > /usr/lib/debug//usr/bin/python2.7...done.
> > > done.
> > > Attaching to program: /usr/bin/python, process 2391
> > > Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols
> from
> > > /usr/lib/debug//lib/x86_64-linux-gnu/ld-2.19.so...done.
> > > done.
> > > Loaded symbols for /lib64/ld-linux-x86-64.so.2
> > > 0x00007f0c1bbb9670 in ?? ()
> > > (gdb) bt
> > > #0  0x00007f0c1bbb9670 in ?? ()
> > > #1  0x00007f0c1bf1a000 in ?? ()
> > > #2  0x00007f0c12099b45 in ?? ()
> > > #3  0x00000000032dbe00 in ?? ()
> > > #4  0x0000000000000000 in ?? ()
> > > (gdb) py-bt
> > > (gdb) py-list
> > > Unable to locate python frame
> > >
> > > I know that's not super helpful, but it's information; I've also tried
> > > pyrasite, but got nothing from it of any use. This problem occurs for
> me
> > > very often and I'm happy to provide a modified environment in which to
> > > capture info if anyone has a suggestion. For now I need to restart my
> > > process and get my jobs running again.
> > >
> > > -N
> > > nik.hodgkin...@collectivehealth.com
> > >
> > >
> > > On Sun, Mar 26, 2017 at 7:48 AM, Gerard Toonstra <gtoons...@gmail.com>
> > > wrote:
> > >
> > >>>
> > >>>
> > >>> By the way, I remember that the scheduler would only spawn one or
> three
> > >>> processes, but I may be wrong.
> > >>> Right now when I start, it spawns 7 separate processes for the
> > scheduler
> > >>> (8 total) with some additional
> > >>> ones spawned when the dag file processor starts.
> > >>>
> > >>>
> > >> These other processes were executor processes. Hopefully with the tips
> > >> below someone who's getting this error
> > >> regularly can attach and dump the thread stack and we see what's going
> > on.
> > >>
> > >> Rgds,
> > >>
> > >> Gerard
> > >>
> > >>
> > >>>
> > >>> On Sun, Mar 26, 2017 at 3:21 AM, Bolke de Bruin <bdbr...@gmail.com>
> > >> wrote:
> > >>>
> > >>>> I case you *think* you have encountered a schedule *hang*, please
> > >> provide
> > >>>> a strace on the parent process, provide process list output that
> shows
> > >>>> defunct scheduler processes, and provide *all* logging (main logs,
> > >>>> scheduler processing log, task logs), preferably in debug mode
> > >>>> (settings.py). Also show memory limits, cpu count and airflow.cfg.
> > >>>>
> > >>>> Thanks
> > >>>> Bolke
> > >>>>
> > >>>>
> > >>>>> On 25 Mar 2017, at 18:16, Bolke de Bruin <bdbr...@gmail.com>
> wrote:
> > >>>>>
> > >>>>> Please specify what “stop doing its job” means. It doesn’t log
> > >> anything
> > >>>> anymore? If it does, the scheduler hasn’t died and hasn’t stopped.
> > >>>>>
> > >>>>> B.
> > >>>>>
> > >>>>>
> > >>>>>> On 24 Mar 2017, at 18:20, Gael Magnan <gaelmag...@gmail.com>
> wrote:
> > >>>>>>
> > >>>>>> We encountered the same kind of problem with the scheduler that
> > >> stopped
> > >>>>>> doing its job even after rebooting. I thought changing the start
> > date
> > >>>> or
> > >>>>>> the state of a task instance might be to blame but I've never been
> > >>>> able to
> > >>>>>> pinpoint the problem either.
> > >>>>>>
> > >>>>>> We are using celery and docker if it helps.
> > >>>>>>
> > >>>>>> Le sam. 25 mars 2017 à 01:53, Bolke de Bruin <bdbr...@gmail.com>
> a
> > >>>> écrit :
> > >>>>>>
> > >>>>>>> We are running *without* num runs for over a year (and never
> have).
> > >>>> It is
> > >>>>>>> a very elusive issue which has not been reproducible.
> > >>>>>>>
> > >>>>>>> I like more info on this but it needs to be very elaborate even
> to
> > >> the
> > >>>>>>> point of access to the system exposing the behavior.
> > >>>>>>>
> > >>>>>>> Bolke
> > >>>>>>>
> > >>>>>>> Sent from my iPhone
> > >>>>>>>
> > >>>>>>>> On 24 Mar 2017, at 16:04, Vijay Ramesh <vi...@change.org>
> wrote:
> > >>>>>>>>
> > >>>>>>>> We literally have a cron job that restarts the scheduler every
> 30
> > >>>> min.
> > >>>>>>> Num
> > >>>>>>>> runs didn't work consistently in rc4, sometimes it would restart
> > >>>> itself
> > >>>>>>> and
> > >>>>>>>> sometimes we'd end up with a few zombie scheduler processes and
> > >>>> things
> > >>>>>>>> would get stuck. Also running locally, without celery.
> > >>>>>>>>
> > >>>>>>>>> On Mar 24, 2017 16:02, <lro...@quartethealth.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>> We have max runs set and still hit this. Our solution is
> dumber:
> > >>>>>>>>> monitoring log output, and kill the scheduler if it stops
> > >> emitting.
> > >>>>>>> Works
> > >>>>>>>>> like a charm.
> > >>>>>>>>>
> > >>>>>>>>>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu <
> > >>>> fhakan.ko...@gmail.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Some solutions to this problem is restarting the scheduler
> > >>>> frequently
> > >>>>>>> or
> > >>>>>>>>>> some sort of monitoring on the scheduler. We have set up a dag
> > >> that
> > >>>>>>> pings
> > >>>>>>>>>> cronitor <https://cronitor.io/> (a dead man's snitch type of
> > >>>> service)
> > >>>>>>>>> every
> > >>>>>>>>>> 10 minutes and the snitch pages you when the scheduler dies
> and
> > >>>> does
> > >>>>>>> not
> > >>>>>>>>>> send a ping to it.
> > >>>>>>>>>>
> > >>>>>>>>>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips <
> > >>>>>>> aphill...@qrmedia.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> We use celery and run into it from time to time.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Bang goes my theory ;-) At least, assuming it's the same
> > >>>> underlying
> > >>>>>>>>>>> cause...
> > >>>>>>>>>>>
> > >>>>>>>>>>> Regards
> > >>>>>>>>>>>
> > >>>>>>>>>>> ap
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> > >
> > > --
> > >
> > >
> > > Read our founder's story.
> > > <https://collectivehealth.com/blog/started-collective-health/>
> > >
> > > *This message may contain confidential, proprietary, or protected
> > > information.  If you are not the intended recipient, you may not
> review,
> > > copy, or distribute this message. If you received this message in
> error,
> > > please notify the sender by reply email and delete this message.*
> >
>
> --
>
>
> Read our founder's story.
> <https://collectivehealth.com/blog/started-collective-health/>
>
> *This message may contain confidential, proprietary, or protected
> information.  If you are not the intended recipient, you may not review,
> copy, or distribute this message. If you received this message in error,
> please notify the sender by reply email and delete this message.*
>

Re: Scheduler silently dies

Reply via email to