We are running *without* num runs for over a year (and never have). It is a very elusive issue which has not been reproducible.
I like more info on this but it needs to be very elaborate even to the point of access to the system exposing the behavior. Bolke Sent from my iPhone > On 24 Mar 2017, at 16:04, Vijay Ramesh <vi...@change.org> wrote: > > We literally have a cron job that restarts the scheduler every 30 min. Num > runs didn't work consistently in rc4, sometimes it would restart itself and > sometimes we'd end up with a few zombie scheduler processes and things > would get stuck. Also running locally, without celery. > >> On Mar 24, 2017 16:02, <lro...@quartethealth.com> wrote: >> >> We have max runs set and still hit this. Our solution is dumber: >> monitoring log output, and kill the scheduler if it stops emitting. Works >> like a charm. >> >>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu <fhakan.ko...@gmail.com> >> wrote: >>> >>> Some solutions to this problem is restarting the scheduler frequently or >>> some sort of monitoring on the scheduler. We have set up a dag that pings >>> cronitor <https://cronitor.io/> (a dead man's snitch type of service) >> every >>> 10 minutes and the snitch pages you when the scheduler dies and does not >>> send a ping to it. >>> >>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips <aphill...@qrmedia.com> >>> wrote: >>> >>>> We use celery and run into it from time to time. >>>>> >>>> >>>> Bang goes my theory ;-) At least, assuming it's the same underlying >>>> cause... >>>> >>>> Regards >>>> >>>> ap >>>> >>