Re: Scheduler getting stuck - request for details

Bolke de Bruin Wed, 07 Sep 2016 14:19:02 -0700

> Op 7 sep. 2016, om 21:37 heeft Jeff Balogh <[email protected]> het 
> volgende geschreven:
> 
> On Wed, Sep 7, 2016 at 12:17 PM, Bolke de Bruin <[email protected]> wrote:
>> Ah this is the more interesting case. Are you getting tasks into SCHEDULED 
>> and then the scheduler itself gets stuck? Or do the workers not execute 
>> anything anymore?
> 
> The tasks are put into the SCHEDULED state but they don't make it to a
> worker. This isn't deterministic. With our patch to clean up orphans,
> a task may flap in SCHEDULED a few times but eventually it makes it to
> a worker.


Ok. So are you implying that the executor is not picking up the tasks or that 
the queue is losing tasks? Are you able to find out what redis is doing when a 
’scheduled’ task is flapping, ie does it receive the task at all? Btw what 
happened before having the scheduled state in?

> 
> The scheduler and workers are otherwise running fine. We've been
> running with the same celery/redis setup for a year.
> 
>> How do you run your scheduler? With num_runs?
> 
> We don't use num_runs. We restart the scheduler when we deploy new code.
> 
>> A later patch checks for these “orphaned_tasks” at scheduler start up.
> 
> We check for the orphans at the top of the scheduler loop, so on every run.

Ok we moved away from this for performance reasons. Depending on a solution for 
the above issue we might need to apply it to every run then.

> 
>> In other words can you provide some more information :-).
>> 
>> Bolke
>> 
>>> Op 7 sep. 2016, om 20:08 heeft Jeff Balogh <[email protected]> het 
>>> volgende geschreven:
>>> 
>>> Ah yep, we're on 
>>> https://github.com/apache/incubator-airflow/commits/54b361d2a.
>>> 
>>> On Wed, Sep 7, 2016 at 10:13 AM, Bolke de Bruin <[email protected]> wrote:
>>>> Hi Jeff,
>>>> 
>>>> That is kind of impossible for 1.7.1.3 as the SCHEDULED state was 
>>>> introduced after release. Are you sure you are on 1.7.1.3 and not on 
>>>> master?
>>>> 
>>>> Bolke
>>>> 
>>>>> Op 7 sep. 2016, om 18:37 heeft Jeff Balogh <[email protected]> 
>>>>> het volgende geschreven:
>>>>> 
>>>>> When we bumped to 1.7.1.3 we found that tasks would go into the new
>>>>> SCHEDULED state and get stuck there. We haven't determined why this
>>>>> happens.
>>>>> 
>>>>> We put a hacky patch into our scheduler that sets state to None for
>>>>> any tasks that are SCHEDULED at the beginning of the schedule loop.
>>>>> 
>>>>> Name: airflow
>>>>> Version: 1.7.1.3
>>>>> Name: celery
>>>>> Version: 3.1.23
>>>>> Name: kombu
>>>>> Version: 3.0.35
>>>>> 
>>>>> redis_version:2.6.13
>>>>> 
>>>>> On Sun, Sep 4, 2016 at 6:34 AM, Bolke de Bruin <[email protected]> wrote:
>>>>>> Hi All,
>>>>>> 
>>>>>> We have had some reports on this list and sometimes on Jira that the 
>>>>>> scheduler sometimes seems to get stuck. I would like to track down this 
>>>>>> issue, but until now much of the reporting has been a bit light on the 
>>>>>> details.
>>>>>> 
>>>>>> First and foremost I am assuming that getting “stuck” is only happening 
>>>>>> when using a CeleryExecutor. To further track down the issue I would 
>>>>>> like to know the following
>>>>>> 
>>>>>> - Airflow version (pip show airflow)
>>>>>> - Celery version (pip show celery)
>>>>>> - Kombu version (pip show kombu)
>>>>>> 
>>>>>> - Redis version (if applicable)
>>>>>> - RabbitMQ version (if applicable)
>>>>>> 
>>>>>> - Sanitized airflow configuration
>>>>>> - Sanitized broker configuration
>>>>>> 
>>>>>> If possible supply, preferably debug, logs of broker, scheduler and 
>>>>>> worker.
>>>>>> 
>>>>>> Thanks!
>>>>>> Bolke
>>>>>> 
>>>> 
>>

Re: Scheduler getting stuck - request for details

Reply via email to