I case you *think* you have encountered a schedule *hang*, please provide a 
strace on the parent process, provide process list output that shows defunct 
scheduler processes, and provide *all* logging (main logs, scheduler processing 
log, task logs), preferably in debug mode (settings.py). Also show memory 
limits, cpu count and airflow.cfg.

Thanks
Bolke


> On 25 Mar 2017, at 18:16, Bolke de Bruin <bdbr...@gmail.com> wrote:
> 
> Please specify what “stop doing its job” means. It doesn’t log anything 
> anymore? If it does, the scheduler hasn’t died and hasn’t stopped.
> 
> B.
> 
> 
>> On 24 Mar 2017, at 18:20, Gael Magnan <gaelmag...@gmail.com> wrote:
>> 
>> We encountered the same kind of problem with the scheduler that stopped
>> doing its job even after rebooting. I thought changing the start date or
>> the state of a task instance might be to blame but I've never been able to
>> pinpoint the problem either.
>> 
>> We are using celery and docker if it helps.
>> 
>> Le sam. 25 mars 2017 à 01:53, Bolke de Bruin <bdbr...@gmail.com> a écrit :
>> 
>>> We are running *without* num runs for over a year (and never have). It is
>>> a very elusive issue which has not been reproducible.
>>> 
>>> I like more info on this but it needs to be very elaborate even to the
>>> point of access to the system exposing the behavior.
>>> 
>>> Bolke
>>> 
>>> Sent from my iPhone
>>> 
>>>> On 24 Mar 2017, at 16:04, Vijay Ramesh <vi...@change.org> wrote:
>>>> 
>>>> We literally have a cron job that restarts the scheduler every 30 min.
>>> Num
>>>> runs didn't work consistently in rc4, sometimes it would restart itself
>>> and
>>>> sometimes we'd end up with a few zombie scheduler processes and things
>>>> would get stuck. Also running locally, without celery.
>>>> 
>>>>> On Mar 24, 2017 16:02, <lro...@quartethealth.com> wrote:
>>>>> 
>>>>> We have max runs set and still hit this. Our solution is dumber:
>>>>> monitoring log output, and kill the scheduler if it stops emitting.
>>> Works
>>>>> like a charm.
>>>>> 
>>>>>> On Mar 24, 2017, at 5:50 PM, F. Hakan Koklu <fhakan.ko...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Some solutions to this problem is restarting the scheduler frequently
>>> or
>>>>>> some sort of monitoring on the scheduler. We have set up a dag that
>>> pings
>>>>>> cronitor <https://cronitor.io/> (a dead man's snitch type of service)
>>>>> every
>>>>>> 10 minutes and the snitch pages you when the scheduler dies and does
>>> not
>>>>>> send a ping to it.
>>>>>> 
>>>>>> On Fri, Mar 24, 2017 at 1:49 PM, Andrew Phillips <
>>> aphill...@qrmedia.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> We use celery and run into it from time to time.
>>>>>>>> 
>>>>>>> 
>>>>>>> Bang goes my theory ;-) At least, assuming it's the same underlying
>>>>>>> cause...
>>>>>>> 
>>>>>>> Regards
>>>>>>> 
>>>>>>> ap
>>>>>>> 
>>>>> 
>>> 
> 

Reply via email to