what may be helpful to dive into this a bit more is "pyrasite" . You need
gdb installed on the machine, but afterwards you can attach to a running
process and
then use python "payloads" to investigate what's going on, for example dump
the stack trace per threads:

http://pyrasite.readthedocs.io/en/latest/Shell.html

http://pyrasite.readthedocs.io/en/latest/Payloads.html

Obviously this is only useful if the process is still around, not in cases
of thread death.
There are some potential issues depending on your platform:

http://pyrasite.readthedocs.io/en/latest/Installing.html

and additionally check for database deadlocks for your setup.

If we can confirm that a thread doesn't simply disappear or becomes a
zombie, then we can develop
a payload together that attempts to identify the cause.


By the way, I remember that the scheduler would only spawn one or three
processes, but I may be wrong.
Right now when I start, it spawns 7 separate processes for the scheduler (8
total) with some additional
ones spawned when the dag file processor starts.

I'm going for a walk now in beautiful weather, but I intend to research
that further. Anyone confirm that indeed it should be that many
with an explanation why that happens?

Rgds,

Gerard



On Sun, Mar 26, 2017 at 3:21 AM, Bolke de Bruin <bdbr...@gmail.com> wrote:

> I case you *think* you have encountered a schedule *hang*, please provide
> a strace on the parent process, provide process list output that shows
> defunct scheduler processes, and provide *all* logging (main logs,
> scheduler processing log, task logs), preferably in debug mode
> (settings.py). Also show memory limits, cpu count and airflow.cfg.
>
> Thanks
> Bolke
>
>
> > On 25 Mar 2017, at 18:16, Bolke de Bruin <bdbr...@gmail.com> wrote:
> >
> > Please specify what “stop doing its job” means. It doesn’t log anything
> anymore? If it does, the scheduler hasn’t died and hasn’t stopped.
> >
> > B.
> >
> >
> >> On 24 Mar 2017, at 18:20, Gael Magnan <gaelmag...@gmail.com> wrote:
> >>
> >> We encountered the same kind of problem with the scheduler that stopped
> >> doing its job even after rebooting. I thought changing the start date or
> >> the state of a task instance might be to blame but I've never been able
> to
> >> pinpoint the problem either.
> >>
> >> We are using celery and docker if it helps.
> >>
> >> Le sam. 25 mars 2017 à 01:53, Bolke de Bruin <bdbr...@gmail.com> a
> écrit :
> >>
> >>> We are running *without* num runs for over a year (and never have). It
> is
> >>> a very elusive issue which has not been reproducible.
> >>>
> >>> I like more info on this but it needs to be very elaborate even to the
> >>> point of access to the system exposing the behavior.
> >>>
> >>> Bolke
> >>>
> >>> Sent from my iPhone
> >>>
> >>>> On 24 Mar 2017, at 16:04, Vijay Ramesh <vi...@change.org> wrote:
> >>>>
> >>>> We literally have a cron job that restarts the scheduler every 30 min.
> >>> Num
> >>>> runs didn't work consistently in rc4, sometimes it would restart
> itself
> >>> and
> >>>> sometimes we'd end up with a few zombie scheduler processes and things
> >>>> would get stuck. Also running locally, without celery.
> >>>>
> >>>>> On Mar 24, 2017 16:02, <lro...@quartethealth.com> wrote:
> >>>>>
> >>>>> We have max runs set and still hit this. Our solution is dumber:
> >>>>> monitoring log output, and kill the scheduler if it stops emitting.
> >>> Works
> >>>>> like a charm.
> >>>>>
> >>>>>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu <fhakan.ko...@gmail.com
> >
> >>>>> wrote:
> >>>>>>
> >>>>>> Some solutions to this problem is restarting the scheduler
> frequently
> >>> or
> >>>>>> some sort of monitoring on the scheduler. We have set up a dag that
> >>> pings
> >>>>>> cronitor <https://cronitor.io/> (a dead man's snitch type of
> service)
> >>>>> every
> >>>>>> 10 minutes and the snitch pages you when the scheduler dies and does
> >>> not
> >>>>>> send a ping to it.
> >>>>>>
> >>>>>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips <
> >>> aphill...@qrmedia.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> We use celery and run into it from time to time.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Bang goes my theory ;-) At least, assuming it's the same underlying
> >>>>>>> cause...
> >>>>>>>
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> ap
> >>>>>>>
> >>>>>
> >>>
> >
>
>

Reply via email to