Re: Performance: backfill --mark_success

harish singh Tue, 06 Dec 2016 13:45:43 -0800

I am working around this in a not-so-pretty-hacky solution.
Instead of "backfill -m" for the dag I am using the "-t"   flag and marking
success only the 1st Task of my pipeline.  Once the backfill is complete, I
used the UI to "Mark success" all "Future" and "Downstream" tasks.


Max,
I am not sure I clearly understood about individual "heartrate" per job.
Can "job_heartbeat_sec" specified for per Task basis?

Thanks,
Harish

On Tue, Dec 6, 2016 at 9:31 AM, Maxime Beauchemin <
[email protected]> wrote:

> Oh thanks for pointing this out, I just did a round of review on that PR.
>
> While we have people's attention around backfill on this thread, I'd love
> to introduce the new term "scheduler catchup" as something distinct to
> `backfill`, at least until we get single code path for both operations.
>
> Max
>
> On Tue, Dec 6, 2016 at 8:56 AM, Bolke de Bruin <[email protected]> wrote:
>
>> There is pr out for not having backfills at all specified at the dag
>> level as well.
>>
>>
>>
>> *Van: *Maxime Beauchemin <[email protected]>
>> *Verzonden: *dinsdag 6 december 2016 16:54
>> *Aan: *[email protected]
>> *CC: *[email protected]
>> *Onderwerp: *Re: Performance: backfill --mark_success
>>
>>
>>
>> The backfill `mark_success` logic could really be optimized by not relying
>>
>> on `airflow run --mark_success` by altering the database state directly
>>
>> instead of actually triggering tasks at all and relying on the backfill
>>
>> logic. Simply scope the set of task instances in scope, and merge (upsert)
>>
>> a `success` state to the db directly.
>>
>>
>>
>> To accelerate it though as it is today, you can reduce some of the
>>
>> heartbeats configurations (job_heartbeat_sec). It's usually desirable to
>>
>> have this setting lower in dev (say 5 seconds) than in production (30-60
>>
>> seconds).
>>
>>
>>
>> I suggest that better default that would be individually configurable for
>>
>> `heartrate` be in place for different types of jobs in `jobs.py`.
>>
>>
>>
>> Max
>>
>>
>>
>> On Mon, Dec 5, 2016 at 12:39 PM, Laura Lorenz <[email protected]>
>>
>> wrote:
>>
>>
>>
>> > This is not that helpful of a message, but I also was having a problem
>> with
>>
>> > `airflow backfill -m` on Airflow version 1.7.0 with it going super
>> slow. In
>>
>> > the end I got around the necessity in that specific case, thinking that
>> it
>>
>> > was broken in 1.7.0 re (
>>
>> > https://gitter.im/apache/incubator-airflow?at=56c3956c1f9833ef7c9ba8ee
>> ),
>>
>> > but now that I am writing this and triangulating 1.7.0's release and
>> that
>>
>> > gitter comment, it doesn't make sense. I'll give it another go.
>>
>> >
>>
>> > On Sun, Dec 4, 2016 at 7:50 AM, Vikas Malhotra <
>> [email protected]
>>
>> > >
>>
>> > wrote:
>>
>> >
>>
>> > > Hello Harish,
>>
>> > >
>>
>> > > Based on our understanding of Python Multiprocessing, a task instance
>>
>> > gets
>>
>> > > a record in underlying database after there is an explicit call to
>>
>> > airflow
>>
>> > > from that library (using Local Executor). So, I might be wrong, but
>> you
>>
>> > > won't find a record in database until and unless that task instance
>> has
>>
>> > got
>>
>> > > initiated. I might be wrong in our assumptions and would love to be
>>
>> > > corrected if that's the case.
>>
>> > >
>>
>> > > We have been using latest only operator and it's seems to be working
>> well
>>
>> > > for skipping tasks if they are not current (basically avoiding
>> backfill
>>
>> > by
>>
>> > > marking all tasks below the latest only operator as skipped). It's
>>
>> > present
>>
>> > > in master branch as of now and I would recommend you to look at that
>>
>> > > operator for backfill.
>>
>> > >
>>
>> > > Thanks!
>>
>> > > Vikas
>>
>> > >
>>
>> > > On Dec 4, 2016 5:23 AM, "harish singh" <[email protected]>
>> wrote:
>>
>> > >
>>
>> > > Hi all,
>>
>> > >
>>
>> > > We have been running Airflow in our production for over 8-9 months
>> now.
>>
>> > > I know there is a separate thread in place for Airflow 2.0.
>>
>> > > But I was not sure if any of the prior version has this fixed.  If
>> not, I
>>
>> > > will add this to the other email thread for 2.0.
>>
>> > >
>>
>> > > When I run airflow backfill with "-m"  (Mark jobs as succeeded without
>>
>> > > running them) ,
>>
>> > > is there a way to optimize this call?
>>
>> > >
>>
>> > > For example:
>>
>> > > airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e
>> 2016-12-01T00:00:00
>>
>> > -m
>>
>> > >
>>
>> > > Here, I am running backfill for a month (from 1st Nov to 1st Dec).
>>
>> > > Essentially, Marking the jobs as succeeded without running them.
>>
>> > >
>>
>> > > It has ben more than an hour and the backfill has managed to reach
>> only
>>
>> > > upto 2nd Nov.
>>
>> > > This seems to be very slow when there is no need to even run the
>> tasks.
>>
>> > >
>>
>> > >
>>
>> > > I am running Airflow 1.7.0:
>>
>> > > These are my related configuration settings:
>>
>> > >
>>
>> > > parallelism = 50
>>
>> > > dag_concurrency = 20
>>
>> > > max_active_runs_per_dag = 8
>>
>> > >
>>
>> > > Also, I have around 9 Dags running (all Hourly). The other 8 dags are
>>
>> > > running as scheduled with start_date of 2016-11-01T00:00:00
>>
>> > >
>>
>> > > My question is, since I am only Marking the jobs as "succeeded"
>>
>> > > without running them,
>>
>> > > can this be done over 1 sql query, instead of per hour, per task
>> basis?
>>
>> > > May be find out all the TaskInstances that needs to be mark succeeded
>>
>> > > and then just run a sql?
>>
>> > >
>>
>> > > I may not be aware of lot of things here and very possible I am
>>
>> > > assuming a lot of things, incorrectly.
>>
>> > > Please feel free to correct me.
>>
>> > >
>>
>> > >
>>
>> > > Thanks,
>>
>> > > Harish
>>
>> > >
>>
>> >
>>
>>
>>
>
>

Re: Performance: backfill --mark_success

Reply via email to