Re: Removal of "run_duration" and its impact on orphaned tasks

Bolke de Bruin Wed, 31 Jul 2019 11:51:04 -0700

Is this all with celery? Afaik Lyft runs with celery? Also if I remember 
correctly the Google guys had a fix for this but that hasn't been upstreamed 
yet?


With celery task do get "lost" after a while with a certain setting (on a phone 
so don't have it handy, we do set a higher default) 

Can you check when those tasks got into "scheduled" and what the time 
difference is with "now"?

B.

Sent from my iPhone

> On 31 Jul 2019, at 20:56, James Meickle <jmeic...@quantopian.com.invalid> 
> wrote:
> 
> Ash:
> 
> We definitely don't run thousands of tasks. Looks like it's closer to 300
> per execution date (and then retries), if I'm using the TI browser right.
> 
> In my case, I found 21 tasks in "scheduled" state after 1 day of not
> restarting. One of our hourly "canary" DAGs got included in the pile-up -
> so it didn't run that hour as expected, so I got paged. (But it wasn't just
> canary tasks, the other 20 tasks were all real and important workflows that
> were not getting scheduled.)
> 
> If we do change the scheduler to have a "cleanup" step within the loop
> instead of pre/post loop, I'd suggest we should:
> - Make the time between cleanups a configurable parameter
> - Log what cleanup steps are being taken and how long they take
> - Add new statsd metrics around cleanups (like "number of orphans reset"),
> to help us understand when and why this happens
> 
> 
> 
>> On Wed, Jul 31, 2019 at 1:25 PM Tao Feng <fengta...@gmail.com> wrote:
>> 
>> Late in the game as I was't aware of `run_duration` option been removed.
>> But just want to point out that Lyft also did very similar with James'
>> setup that we run the scheduler in a fix internal instead of infinite loop
>> and let the runit/supervisor to restart the scheduler process. This is to
>> solve: 1. orphaned tasks not getting clean up successfully when it runs on
>> infinite loop; 2. Make sure stale / deleted DAG will get clean up(
>> 
>> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1438
>> ?
>> <https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1438?>
>> )
>> properly.
>> 
>> I think if it goes with removing this option and let scheduler runs in an
>> infinite loop, we need to change the schedule loop to handle the clean up
>> process if it hasn't been done.
>> 
>>> On Wed, Jul 31, 2019 at 10:10 AM Ash Berlin-Taylor <a...@apache.org> wrote:
>>> 
>>> Thanks for testing this out James, shame to discover we still have
>>> problems in that area. Do you have an idea of how many tasks per day we
>> are
>>> talking about here?
>>> 
>>> Your cluster schedules quite a large number of tasks over the day (in the
>>> 1k-10k range?) right?
>>> 
>>> I'd say whatever causes a task to become orphaned _while_ the scheduler
>> is
>>> still running is the actual bug, and running the orphan detection more
>>> often may just be replacing one patch (the run duration) with another one
>>> (running the orphan detection more than at start up).
>>> 
>>> -ash
>>> 
>>>> On 31 Jul 2019, at 16:43, James Meickle <jmeic...@quantopian.com
>> .INVALID>
>>> wrote:
>>>> 
>>>> In my testing of 1.10.4rc3, I discovered that we were getting hit by a
>>>> process leak bug (which Ash has since fixed in 1.10.4rc4). This process
>>>> leak was minimal impact for most users, but was exacerbated in our case
>>> by
>>>> using "run_duration" to restart the scheduler every 10 minutes.
>>>> 
>>>> To mitigate that issue while remaining on the RC, we removed the use of
>>>> "run_duration", since it is deprecated as of master anyways:
>>>> 
>>> 
>> https://github.com/apache/airflow/blob/master/UPDATING.md#remove-run_duration
>>>> 
>>>> Unfortunately, testing on our cluster (1.10.4rc3 plus no
>> "run_duration")
>>>> has revealed that while the process leak issue was mitigated, that
>> we're
>>>> now facing issues with orphaned tasks. These tasks are marked as
>>>> "scheduled" by the scheduler, but _not_ successfully queued in Celery
>>> even
>>>> after multiple scheduler loops. Around ~24h after last restart, we
>> start
>>>> having enough stuck tasks that the system starts paging and requires a
>>>> manual restart.
>>>> 
>>>> Rather than generic "scheduler instability", this specific issue was
>> one
>>> of
>>>> the reasons why we'd originally added the scheduler restart. But it
>>> appears
>>>> that on master, the orphaned task detection code still only runs on
>>>> scheduler start despite removing "run_duration":
>>>> 
>>> 
>> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1328
>>>> 
>>>> Rather than immediately filing an issue I wanted to inquire a bit more
>>>> about why this orphan detection code is only run on scheduler start,
>>>> whether it would be safe to send in a PR to run it more often (maybe a
>>>> tunable parameter?), and if there's some other configuration issue with
>>>> Celery (in our case, backed by AWS Elasticache) that would cause us to
>>> see
>>>> orphaned tasks frequently.
>>> 
>>> 
>>

Re: Removal of "run_duration" and its impact on orphaned tasks

Reply via email to