`job_heartbeat_sec` is a configuration parameter (airflow.cfg) that sets a default for how often jobs wait between "cycles". Lowering it in your dev environment will make backfills go faster.
Max On Tue, Dec 6, 2016 at 1:44 PM, harish singh <[email protected]> wrote: > I am working around this in a not-so-pretty-hacky solution. > Instead of "backfill -m" for the dag I am using the "-t" flag and > marking success only the 1st Task of my pipeline. Once the backfill is > complete, I used the UI to "Mark success" all "Future" and "Downstream" > tasks. > > Max, > I am not sure I clearly understood about individual "heartrate" per job. > Can "job_heartbeat_sec" specified for per Task basis? > > Thanks, > Harish > > On Tue, Dec 6, 2016 at 9:31 AM, Maxime Beauchemin < > [email protected]> wrote: > >> Oh thanks for pointing this out, I just did a round of review on that PR. >> >> While we have people's attention around backfill on this thread, I'd love >> to introduce the new term "scheduler catchup" as something distinct to >> `backfill`, at least until we get single code path for both operations. >> >> Max >> >> On Tue, Dec 6, 2016 at 8:56 AM, Bolke de Bruin <[email protected]> wrote: >> >>> There is pr out for not having backfills at all specified at the dag >>> level as well. >>> >>> >>> >>> *Van: *Maxime Beauchemin <[email protected]> >>> *Verzonden: *dinsdag 6 december 2016 16:54 >>> *Aan: *[email protected] >>> *CC: *[email protected] >>> *Onderwerp: *Re: Performance: backfill --mark_success >>> >>> >>> >>> The backfill `mark_success` logic could really be optimized by not >>> relying >>> >>> on `airflow run --mark_success` by altering the database state directly >>> >>> instead of actually triggering tasks at all and relying on the backfill >>> >>> logic. Simply scope the set of task instances in scope, and merge >>> (upsert) >>> >>> a `success` state to the db directly. >>> >>> >>> >>> To accelerate it though as it is today, you can reduce some of the >>> >>> heartbeats configurations (job_heartbeat_sec). It's usually desirable to >>> >>> have this setting lower in dev (say 5 seconds) than in production (30-60 >>> >>> seconds). >>> >>> >>> >>> I suggest that better default that would be individually configurable for >>> >>> `heartrate` be in place for different types of jobs in `jobs.py`. >>> >>> >>> >>> Max >>> >>> >>> >>> On Mon, Dec 5, 2016 at 12:39 PM, Laura Lorenz <[email protected]> >>> >>> wrote: >>> >>> >>> >>> > This is not that helpful of a message, but I also was having a problem >>> with >>> >>> > `airflow backfill -m` on Airflow version 1.7.0 with it going super >>> slow. In >>> >>> > the end I got around the necessity in that specific case, thinking >>> that it >>> >>> > was broken in 1.7.0 re ( >>> >>> > https://gitter.im/apache/incubator-airflow?at=56c3956c1f9833ef7c9ba8ee >>> ), >>> >>> > but now that I am writing this and triangulating 1.7.0's release and >>> that >>> >>> > gitter comment, it doesn't make sense. I'll give it another go. >>> >>> > >>> >>> > On Sun, Dec 4, 2016 at 7:50 AM, Vikas Malhotra < >>> [email protected] >>> >>> > > >>> >>> > wrote: >>> >>> > >>> >>> > > Hello Harish, >>> >>> > > >>> >>> > > Based on our understanding of Python Multiprocessing, a task instance >>> >>> > gets >>> >>> > > a record in underlying database after there is an explicit call to >>> >>> > airflow >>> >>> > > from that library (using Local Executor). So, I might be wrong, but >>> you >>> >>> > > won't find a record in database until and unless that task instance >>> has >>> >>> > got >>> >>> > > initiated. I might be wrong in our assumptions and would love to be >>> >>> > > corrected if that's the case. >>> >>> > > >>> >>> > > We have been using latest only operator and it's seems to be working >>> well >>> >>> > > for skipping tasks if they are not current (basically avoiding >>> backfill >>> >>> > by >>> >>> > > marking all tasks below the latest only operator as skipped). It's >>> >>> > present >>> >>> > > in master branch as of now and I would recommend you to look at that >>> >>> > > operator for backfill. >>> >>> > > >>> >>> > > Thanks! >>> >>> > > Vikas >>> >>> > > >>> >>> > > On Dec 4, 2016 5:23 AM, "harish singh" <[email protected]> >>> wrote: >>> >>> > > >>> >>> > > Hi all, >>> >>> > > >>> >>> > > We have been running Airflow in our production for over 8-9 months >>> now. >>> >>> > > I know there is a separate thread in place for Airflow 2.0. >>> >>> > > But I was not sure if any of the prior version has this fixed. If >>> not, I >>> >>> > > will add this to the other email thread for 2.0. >>> >>> > > >>> >>> > > When I run airflow backfill with "-m" (Mark jobs as succeeded >>> without >>> >>> > > running them) , >>> >>> > > is there a way to optimize this call? >>> >>> > > >>> >>> > > For example: >>> >>> > > airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e >>> 2016-12-01T00:00:00 >>> >>> > -m >>> >>> > > >>> >>> > > Here, I am running backfill for a month (from 1st Nov to 1st Dec). >>> >>> > > Essentially, Marking the jobs as succeeded without running them. >>> >>> > > >>> >>> > > It has ben more than an hour and the backfill has managed to reach >>> only >>> >>> > > upto 2nd Nov. >>> >>> > > This seems to be very slow when there is no need to even run the >>> tasks. >>> >>> > > >>> >>> > > >>> >>> > > I am running Airflow 1.7.0: >>> >>> > > These are my related configuration settings: >>> >>> > > >>> >>> > > parallelism = 50 >>> >>> > > dag_concurrency = 20 >>> >>> > > max_active_runs_per_dag = 8 >>> >>> > > >>> >>> > > Also, I have around 9 Dags running (all Hourly). The other 8 dags are >>> >>> > > running as scheduled with start_date of 2016-11-01T00:00:00 >>> >>> > > >>> >>> > > My question is, since I am only Marking the jobs as "succeeded" >>> >>> > > without running them, >>> >>> > > can this be done over 1 sql query, instead of per hour, per task >>> basis? >>> >>> > > May be find out all the TaskInstances that needs to be mark succeeded >>> >>> > > and then just run a sql? >>> >>> > > >>> >>> > > I may not be aware of lot of things here and very possible I am >>> >>> > > assuming a lot of things, incorrectly. >>> >>> > > Please feel free to correct me. >>> >>> > > >>> >>> > > >>> >>> > > Thanks, >>> >>> > > Harish >>> >>> > > >>> >>> > >>> >>> >>> >> >> >
