Re: Performance: backfill --mark_success

Maxime Beauchemin Tue, 06 Dec 2016 09:31:23 -0800

Oh thanks for pointing this out, I just did a round of review on that PR.

While we have people's attention around backfill on this thread, I'd love
to introduce the new term "scheduler catchup" as something distinct to
`backfill`, at least until we get single code path for both operations.


Max

On Tue, Dec 6, 2016 at 8:56 AM, Bolke de Bruin <[email protected]> wrote:

> There is pr out for not having backfills at all specified at the dag level
> as well.
>
>
>
> *Van: *Maxime Beauchemin <[email protected]>
> *Verzonden: *dinsdag 6 december 2016 16:54
> *Aan: *[email protected]
> *CC: *[email protected]
> *Onderwerp: *Re: Performance: backfill --mark_success
>
>
>
> The backfill `mark_success` logic could really be optimized by not relying
>
> on `airflow run --mark_success` by altering the database state directly
>
> instead of actually triggering tasks at all and relying on the backfill
>
> logic. Simply scope the set of task instances in scope, and merge (upsert)
>
> a `success` state to the db directly.
>
>
>
> To accelerate it though as it is today, you can reduce some of the
>
> heartbeats configurations (job_heartbeat_sec). It's usually desirable to
>
> have this setting lower in dev (say 5 seconds) than in production (30-60
>
> seconds).
>
>
>
> I suggest that better default that would be individually configurable for
>
> `heartrate` be in place for different types of jobs in `jobs.py`.
>
>
>
> Max
>
>
>
> On Mon, Dec 5, 2016 at 12:39 PM, Laura Lorenz <[email protected]>
>
> wrote:
>
>
>
> > This is not that helpful of a message, but I also was having a problem
> with
>
> > `airflow backfill -m` on Airflow version 1.7.0 with it going super slow.
> In
>
> > the end I got around the necessity in that specific case, thinking that
> it
>
> > was broken in 1.7.0 re (
>
> > https://gitter.im/apache/incubator-airflow?at=56c3956c1f9833ef7c9ba8ee),
>
> > but now that I am writing this and triangulating 1.7.0's release and that
>
> > gitter comment, it doesn't make sense. I'll give it another go.
>
> >
>
> > On Sun, Dec 4, 2016 at 7:50 AM, Vikas Malhotra <
> [email protected]
>
> > >
>
> > wrote:
>
> >
>
> > > Hello Harish,
>
> > >
>
> > > Based on our understanding of Python Multiprocessing, a task instance
>
> > gets
>
> > > a record in underlying database after there is an explicit call to
>
> > airflow
>
> > > from that library (using Local Executor). So, I might be wrong, but you
>
> > > won't find a record in database until and unless that task instance has
>
> > got
>
> > > initiated. I might be wrong in our assumptions and would love to be
>
> > > corrected if that's the case.
>
> > >
>
> > > We have been using latest only operator and it's seems to be working
> well
>
> > > for skipping tasks if they are not current (basically avoiding backfill
>
> > by
>
> > > marking all tasks below the latest only operator as skipped). It's
>
> > present
>
> > > in master branch as of now and I would recommend you to look at that
>
> > > operator for backfill.
>
> > >
>
> > > Thanks!
>
> > > Vikas
>
> > >
>
> > > On Dec 4, 2016 5:23 AM, "harish singh" <[email protected]>
> wrote:
>
> > >
>
> > > Hi all,
>
> > >
>
> > > We have been running Airflow in our production for over 8-9 months now.
>
> > > I know there is a separate thread in place for Airflow 2.0.
>
> > > But I was not sure if any of the prior version has this fixed.  If
> not, I
>
> > > will add this to the other email thread for 2.0.
>
> > >
>
> > > When I run airflow backfill with "-m"  (Mark jobs as succeeded without
>
> > > running them) ,
>
> > > is there a way to optimize this call?
>
> > >
>
> > > For example:
>
> > > airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e 2016-12-01T00:00:00
>
> > -m
>
> > >
>
> > > Here, I am running backfill for a month (from 1st Nov to 1st Dec).
>
> > > Essentially, Marking the jobs as succeeded without running them.
>
> > >
>
> > > It has ben more than an hour and the backfill has managed to reach only
>
> > > upto 2nd Nov.
>
> > > This seems to be very slow when there is no need to even run the tasks.
>
> > >
>
> > >
>
> > > I am running Airflow 1.7.0:
>
> > > These are my related configuration settings:
>
> > >
>
> > > parallelism = 50
>
> > > dag_concurrency = 20
>
> > > max_active_runs_per_dag = 8
>
> > >
>
> > > Also, I have around 9 Dags running (all Hourly). The other 8 dags are
>
> > > running as scheduled with start_date of 2016-11-01T00:00:00
>
> > >
>
> > > My question is, since I am only Marking the jobs as "succeeded"
>
> > > without running them,
>
> > > can this be done over 1 sql query, instead of per hour, per task basis?
>
> > > May be find out all the TaskInstances that needs to be mark succeeded
>
> > > and then just run a sql?
>
> > >
>
> > > I may not be aware of lot of things here and very possible I am
>
> > > assuming a lot of things, incorrectly.
>
> > > Please feel free to correct me.
>
> > >
>
> > >
>
> > > Thanks,
>
> > > Harish
>
> > >
>
> >
>
>
>

Re: Performance: backfill --mark_success

Reply via email to