Is this all with celery? Afaik Lyft runs with celery? Also if I remember correctly the Google guys had a fix for this but that hasn't been upstreamed yet?
With celery task do get "lost" after a while with a certain setting (on a phone so don't have it handy, we do set a higher default) Can you check when those tasks got into "scheduled" and what the time difference is with "now"? B. Sent from my iPhone > On 31 Jul 2019, at 20:56, James Meickle <jmeic...@quantopian.com.invalid> > wrote: > > Ash: > > We definitely don't run thousands of tasks. Looks like it's closer to 300 > per execution date (and then retries), if I'm using the TI browser right. > > In my case, I found 21 tasks in "scheduled" state after 1 day of not > restarting. One of our hourly "canary" DAGs got included in the pile-up - > so it didn't run that hour as expected, so I got paged. (But it wasn't just > canary tasks, the other 20 tasks were all real and important workflows that > were not getting scheduled.) > > If we do change the scheduler to have a "cleanup" step within the loop > instead of pre/post loop, I'd suggest we should: > - Make the time between cleanups a configurable parameter > - Log what cleanup steps are being taken and how long they take > - Add new statsd metrics around cleanups (like "number of orphans reset"), > to help us understand when and why this happens > > > >> On Wed, Jul 31, 2019 at 1:25 PM Tao Feng <fengta...@gmail.com> wrote: >> >> Late in the game as I was't aware of `run_duration` option been removed. >> But just want to point out that Lyft also did very similar with James' >> setup that we run the scheduler in a fix internal instead of infinite loop >> and let the runit/supervisor to restart the scheduler process. This is to >> solve: 1. orphaned tasks not getting clean up successfully when it runs on >> infinite loop; 2. Make sure stale / deleted DAG will get clean up( >> >> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1438 >> ? >> <https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1438?> >> ) >> properly. >> >> I think if it goes with removing this option and let scheduler runs in an >> infinite loop, we need to change the schedule loop to handle the clean up >> process if it hasn't been done. >> >>> On Wed, Jul 31, 2019 at 10:10 AM Ash Berlin-Taylor <a...@apache.org> wrote: >>> >>> Thanks for testing this out James, shame to discover we still have >>> problems in that area. Do you have an idea of how many tasks per day we >> are >>> talking about here? >>> >>> Your cluster schedules quite a large number of tasks over the day (in the >>> 1k-10k range?) right? >>> >>> I'd say whatever causes a task to become orphaned _while_ the scheduler >> is >>> still running is the actual bug, and running the orphan detection more >>> often may just be replacing one patch (the run duration) with another one >>> (running the orphan detection more than at start up). >>> >>> -ash >>> >>>> On 31 Jul 2019, at 16:43, James Meickle <jmeic...@quantopian.com >> .INVALID> >>> wrote: >>>> >>>> In my testing of 1.10.4rc3, I discovered that we were getting hit by a >>>> process leak bug (which Ash has since fixed in 1.10.4rc4). This process >>>> leak was minimal impact for most users, but was exacerbated in our case >>> by >>>> using "run_duration" to restart the scheduler every 10 minutes. >>>> >>>> To mitigate that issue while remaining on the RC, we removed the use of >>>> "run_duration", since it is deprecated as of master anyways: >>>> >>> >> https://github.com/apache/airflow/blob/master/UPDATING.md#remove-run_duration >>>> >>>> Unfortunately, testing on our cluster (1.10.4rc3 plus no >> "run_duration") >>>> has revealed that while the process leak issue was mitigated, that >> we're >>>> now facing issues with orphaned tasks. These tasks are marked as >>>> "scheduled" by the scheduler, but _not_ successfully queued in Celery >>> even >>>> after multiple scheduler loops. Around ~24h after last restart, we >> start >>>> having enough stuck tasks that the system starts paging and requires a >>>> manual restart. >>>> >>>> Rather than generic "scheduler instability", this specific issue was >> one >>> of >>>> the reasons why we'd originally added the scheduler restart. But it >>> appears >>>> that on master, the orphaned task detection code still only runs on >>>> scheduler start despite removing "run_duration": >>>> >>> >> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1328 >>>> >>>> Rather than immediately filing an issue I wanted to inquire a bit more >>>> about why this orphan detection code is only run on scheduler start, >>>> whether it would be safe to send in a PR to run it more often (maybe a >>>> tunable parameter?), and if there's some other configuration issue with >>>> Celery (in our case, backed by AWS Elasticache) that would cause us to >>> see >>>> orphaned tasks frequently. >>> >>> >>