This is not that helpful of a message, but I also was having a problem with `airflow backfill -m` on Airflow version 1.7.0 with it going super slow. In the end I got around the necessity in that specific case, thinking that it was broken in 1.7.0 re ( https://gitter.im/apache/incubator-airflow?at=56c3956c1f9833ef7c9ba8ee), but now that I am writing this and triangulating 1.7.0's release and that gitter comment, it doesn't make sense. I'll give it another go.
On Sun, Dec 4, 2016 at 7:50 AM, Vikas Malhotra <[email protected]> wrote: > Hello Harish, > > Based on our understanding of Python Multiprocessing, a task instance gets > a record in underlying database after there is an explicit call to airflow > from that library (using Local Executor). So, I might be wrong, but you > won't find a record in database until and unless that task instance has got > initiated. I might be wrong in our assumptions and would love to be > corrected if that's the case. > > We have been using latest only operator and it's seems to be working well > for skipping tasks if they are not current (basically avoiding backfill by > marking all tasks below the latest only operator as skipped). It's present > in master branch as of now and I would recommend you to look at that > operator for backfill. > > Thanks! > Vikas > > On Dec 4, 2016 5:23 AM, "harish singh" <[email protected]> wrote: > > Hi all, > > We have been running Airflow in our production for over 8-9 months now. > I know there is a separate thread in place for Airflow 2.0. > But I was not sure if any of the prior version has this fixed. If not, I > will add this to the other email thread for 2.0. > > When I run airflow backfill with "-m" (Mark jobs as succeeded without > running them) , > is there a way to optimize this call? > > For example: > airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e 2016-12-01T00:00:00 -m > > Here, I am running backfill for a month (from 1st Nov to 1st Dec). > Essentially, Marking the jobs as succeeded without running them. > > It has ben more than an hour and the backfill has managed to reach only > upto 2nd Nov. > This seems to be very slow when there is no need to even run the tasks. > > > I am running Airflow 1.7.0: > These are my related configuration settings: > > parallelism = 50 > dag_concurrency = 20 > max_active_runs_per_dag = 8 > > Also, I have around 9 Dags running (all Hourly). The other 8 dags are > running as scheduled with start_date of 2016-11-01T00:00:00 > > My question is, since I am only Marking the jobs as "succeeded" > without running them, > can this be done over 1 sql query, instead of per hour, per task basis? > May be find out all the TaskInstances that needs to be mark succeeded > and then just run a sql? > > I may not be aware of lot of things here and very possible I am > assuming a lot of things, incorrectly. > Please feel free to correct me. > > > Thanks, > Harish >
