Re: Performance: backfill --mark_success

Maxime Beauchemin Tue, 06 Dec 2016 08:54:29 -0800

The backfill `mark_success` logic could really be optimized by not relying
on `airflow run --mark_success` by altering the database state directly
instead of actually triggering tasks at all and relying on the backfill
logic. Simply scope the set of task instances in scope, and merge (upsert)
a `success` state to the db directly.


To accelerate it though as it is today, you can reduce some of the
heartbeats configurations (job_heartbeat_sec). It's usually desirable to
have this setting lower in dev (say 5 seconds) than in production (30-60
seconds).

I suggest that better default that would be individually configurable for
`heartrate` be in place for different types of jobs in `jobs.py`.

Max

On Mon, Dec 5, 2016 at 12:39 PM, Laura Lorenz <[email protected]>
wrote:

> This is not that helpful of a message, but I also was having a problem with
> `airflow backfill -m` on Airflow version 1.7.0 with it going super slow. In
> the end I got around the necessity in that specific case, thinking that it
> was broken in 1.7.0 re (
> https://gitter.im/apache/incubator-airflow?at=56c3956c1f9833ef7c9ba8ee),
> but now that I am writing this and triangulating 1.7.0's release and that
> gitter comment, it doesn't make sense. I'll give it another go.
>
> On Sun, Dec 4, 2016 at 7:50 AM, Vikas Malhotra <[email protected]
> >
> wrote:
>
> > Hello Harish,
> >
> > Based on our understanding of Python Multiprocessing, a task instance
> gets
> > a record in underlying database after there is an explicit call to
> airflow
> > from that library (using Local Executor). So, I might be wrong, but you
> > won't find a record in database until and unless that task instance has
> got
> > initiated. I might be wrong in our assumptions and would love to be
> > corrected if that's the case.
> >
> > We have been using latest only operator and it's seems to be working well
> > for skipping tasks if they are not current (basically avoiding backfill
> by
> > marking all tasks below the latest only operator as skipped). It's
> present
> > in master branch as of now and I would recommend you to look at that
> > operator for backfill.
> >
> > Thanks!
> > Vikas
> >
> > On Dec 4, 2016 5:23 AM, "harish singh" <[email protected]> wrote:
> >
> > Hi all,
> >
> > We have been running Airflow in our production for over 8-9 months now.
> > I know there is a separate thread in place for Airflow 2.0.
> > But I was not sure if any of the prior version has this fixed.  If not, I
> > will add this to the other email thread for 2.0.
> >
> > When I run airflow backfill with "-m"  (Mark jobs as succeeded without
> > running them) ,
> > is there a way to optimize this call?
> >
> > For example:
> > airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e 2016-12-01T00:00:00
> -m
> >
> > Here, I am running backfill for a month (from 1st Nov to 1st Dec).
> > Essentially, Marking the jobs as succeeded without running them.
> >
> > It has ben more than an hour and the backfill has managed to reach only
> > upto 2nd Nov.
> > This seems to be very slow when there is no need to even run the tasks.
> >
> >
> > I am running Airflow 1.7.0:
> > These are my related configuration settings:
> >
> > parallelism = 50
> > dag_concurrency = 20
> > max_active_runs_per_dag = 8
> >
> > Also, I have around 9 Dags running (all Hourly). The other 8 dags are
> > running as scheduled with start_date of 2016-11-01T00:00:00
> >
> > My question is, since I am only Marking the jobs as "succeeded"
> > without running them,
> > can this be done over 1 sql query, instead of per hour, per task basis?
> > May be find out all the TaskInstances that needs to be mark succeeded
> > and then just run a sql?
> >
> > I may not be aware of lot of things here and very possible I am
> > assuming a lot of things, incorrectly.
> > Please feel free to correct me.
> >
> >
> > Thanks,
> > Harish
> >
>

Re: Performance: backfill --mark_success

Reply via email to