Oh thanks for pointing this out, I just did a round of review on that PR. While we have people's attention around backfill on this thread, I'd love to introduce the new term "scheduler catchup" as something distinct to `backfill`, at least until we get single code path for both operations.
Max On Tue, Dec 6, 2016 at 8:56 AM, Bolke de Bruin <[email protected]> wrote: > There is pr out for not having backfills at all specified at the dag level > as well. > > > > *Van: *Maxime Beauchemin <[email protected]> > *Verzonden: *dinsdag 6 december 2016 16:54 > *Aan: *[email protected] > *CC: *[email protected] > *Onderwerp: *Re: Performance: backfill --mark_success > > > > The backfill `mark_success` logic could really be optimized by not relying > > on `airflow run --mark_success` by altering the database state directly > > instead of actually triggering tasks at all and relying on the backfill > > logic. Simply scope the set of task instances in scope, and merge (upsert) > > a `success` state to the db directly. > > > > To accelerate it though as it is today, you can reduce some of the > > heartbeats configurations (job_heartbeat_sec). It's usually desirable to > > have this setting lower in dev (say 5 seconds) than in production (30-60 > > seconds). > > > > I suggest that better default that would be individually configurable for > > `heartrate` be in place for different types of jobs in `jobs.py`. > > > > Max > > > > On Mon, Dec 5, 2016 at 12:39 PM, Laura Lorenz <[email protected]> > > wrote: > > > > > This is not that helpful of a message, but I also was having a problem > with > > > `airflow backfill -m` on Airflow version 1.7.0 with it going super slow. > In > > > the end I got around the necessity in that specific case, thinking that > it > > > was broken in 1.7.0 re ( > > > https://gitter.im/apache/incubator-airflow?at=56c3956c1f9833ef7c9ba8ee), > > > but now that I am writing this and triangulating 1.7.0's release and that > > > gitter comment, it doesn't make sense. I'll give it another go. > > > > > > On Sun, Dec 4, 2016 at 7:50 AM, Vikas Malhotra < > [email protected] > > > > > > > wrote: > > > > > > > Hello Harish, > > > > > > > > Based on our understanding of Python Multiprocessing, a task instance > > > gets > > > > a record in underlying database after there is an explicit call to > > > airflow > > > > from that library (using Local Executor). So, I might be wrong, but you > > > > won't find a record in database until and unless that task instance has > > > got > > > > initiated. I might be wrong in our assumptions and would love to be > > > > corrected if that's the case. > > > > > > > > We have been using latest only operator and it's seems to be working > well > > > > for skipping tasks if they are not current (basically avoiding backfill > > > by > > > > marking all tasks below the latest only operator as skipped). It's > > > present > > > > in master branch as of now and I would recommend you to look at that > > > > operator for backfill. > > > > > > > > Thanks! > > > > Vikas > > > > > > > > On Dec 4, 2016 5:23 AM, "harish singh" <[email protected]> > wrote: > > > > > > > > Hi all, > > > > > > > > We have been running Airflow in our production for over 8-9 months now. > > > > I know there is a separate thread in place for Airflow 2.0. > > > > But I was not sure if any of the prior version has this fixed. If > not, I > > > > will add this to the other email thread for 2.0. > > > > > > > > When I run airflow backfill with "-m" (Mark jobs as succeeded without > > > > running them) , > > > > is there a way to optimize this call? > > > > > > > > For example: > > > > airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e 2016-12-01T00:00:00 > > > -m > > > > > > > > Here, I am running backfill for a month (from 1st Nov to 1st Dec). > > > > Essentially, Marking the jobs as succeeded without running them. > > > > > > > > It has ben more than an hour and the backfill has managed to reach only > > > > upto 2nd Nov. > > > > This seems to be very slow when there is no need to even run the tasks. > > > > > > > > > > > > I am running Airflow 1.7.0: > > > > These are my related configuration settings: > > > > > > > > parallelism = 50 > > > > dag_concurrency = 20 > > > > max_active_runs_per_dag = 8 > > > > > > > > Also, I have around 9 Dags running (all Hourly). The other 8 dags are > > > > running as scheduled with start_date of 2016-11-01T00:00:00 > > > > > > > > My question is, since I am only Marking the jobs as "succeeded" > > > > without running them, > > > > can this be done over 1 sql query, instead of per hour, per task basis? > > > > May be find out all the TaskInstances that needs to be mark succeeded > > > > and then just run a sql? > > > > > > > > I may not be aware of lot of things here and very possible I am > > > > assuming a lot of things, incorrectly. > > > > Please feel free to correct me. > > > > > > > > > > > > Thanks, > > > > Harish > > > > > > > > > >
