Re: Performance: backfill --mark_success

Maxime Beauchemin Wed, 07 Dec 2016 09:52:48 -0800

`job_heartbeat_sec` is a configuration parameter (airflow.cfg) that sets a
default for how often jobs wait between "cycles". Lowering it in your dev
environment will make backfills go faster.


Max

On Tue, Dec 6, 2016 at 1:44 PM, harish singh <[email protected]>
wrote:

> I am working around this in a not-so-pretty-hacky solution.
> Instead of "backfill -m" for the dag I am using the "-t"   flag and
> marking success only the 1st Task of my pipeline.  Once the backfill is
> complete, I used the UI to "Mark success" all "Future" and "Downstream"
> tasks.
>
> Max,
> I am not sure I clearly understood about individual "heartrate" per job.
> Can "job_heartbeat_sec" specified for per Task basis?
>
> Thanks,
> Harish
>
> On Tue, Dec 6, 2016 at 9:31 AM, Maxime Beauchemin <
> [email protected]> wrote:
>
>> Oh thanks for pointing this out, I just did a round of review on that PR.
>>
>> While we have people's attention around backfill on this thread, I'd love
>> to introduce the new term "scheduler catchup" as something distinct to
>> `backfill`, at least until we get single code path for both operations.
>>
>> Max
>>
>> On Tue, Dec 6, 2016 at 8:56 AM, Bolke de Bruin <[email protected]> wrote:
>>
>>> There is pr out for not having backfills at all specified at the dag
>>> level as well.
>>>
>>>
>>>
>>> *Van: *Maxime Beauchemin <[email protected]>
>>> *Verzonden: *dinsdag 6 december 2016 16:54
>>> *Aan: *[email protected]
>>> *CC: *[email protected]
>>> *Onderwerp: *Re: Performance: backfill --mark_success
>>>
>>>
>>>
>>> The backfill `mark_success` logic could really be optimized by not
>>> relying
>>>
>>> on `airflow run --mark_success` by altering the database state directly
>>>
>>> instead of actually triggering tasks at all and relying on the backfill
>>>
>>> logic. Simply scope the set of task instances in scope, and merge
>>> (upsert)
>>>
>>> a `success` state to the db directly.
>>>
>>>
>>>
>>> To accelerate it though as it is today, you can reduce some of the
>>>
>>> heartbeats configurations (job_heartbeat_sec). It's usually desirable to
>>>
>>> have this setting lower in dev (say 5 seconds) than in production (30-60
>>>
>>> seconds).
>>>
>>>
>>>
>>> I suggest that better default that would be individually configurable for
>>>
>>> `heartrate` be in place for different types of jobs in `jobs.py`.
>>>
>>>
>>>
>>> Max
>>>
>>>
>>>
>>> On Mon, Dec 5, 2016 at 12:39 PM, Laura Lorenz <[email protected]>
>>>
>>> wrote:
>>>
>>>
>>>
>>> > This is not that helpful of a message, but I also was having a problem
>>> with
>>>
>>> > `airflow backfill -m` on Airflow version 1.7.0 with it going super
>>> slow. In
>>>
>>> > the end I got around the necessity in that specific case, thinking
>>> that it
>>>
>>> > was broken in 1.7.0 re (
>>>
>>> > https://gitter.im/apache/incubator-airflow?at=56c3956c1f9833ef7c9ba8ee
>>> ),
>>>
>>> > but now that I am writing this and triangulating 1.7.0's release and
>>> that
>>>
>>> > gitter comment, it doesn't make sense. I'll give it another go.
>>>
>>> >
>>>
>>> > On Sun, Dec 4, 2016 at 7:50 AM, Vikas Malhotra <
>>> [email protected]
>>>
>>> > >
>>>
>>> > wrote:
>>>
>>> >
>>>
>>> > > Hello Harish,
>>>
>>> > >
>>>
>>> > > Based on our understanding of Python Multiprocessing, a task instance
>>>
>>> > gets
>>>
>>> > > a record in underlying database after there is an explicit call to
>>>
>>> > airflow
>>>
>>> > > from that library (using Local Executor). So, I might be wrong, but
>>> you
>>>
>>> > > won't find a record in database until and unless that task instance
>>> has
>>>
>>> > got
>>>
>>> > > initiated. I might be wrong in our assumptions and would love to be
>>>
>>> > > corrected if that's the case.
>>>
>>> > >
>>>
>>> > > We have been using latest only operator and it's seems to be working
>>> well
>>>
>>> > > for skipping tasks if they are not current (basically avoiding
>>> backfill
>>>
>>> > by
>>>
>>> > > marking all tasks below the latest only operator as skipped). It's
>>>
>>> > present
>>>
>>> > > in master branch as of now and I would recommend you to look at that
>>>
>>> > > operator for backfill.
>>>
>>> > >
>>>
>>> > > Thanks!
>>>
>>> > > Vikas
>>>
>>> > >
>>>
>>> > > On Dec 4, 2016 5:23 AM, "harish singh" <[email protected]>
>>> wrote:
>>>
>>> > >
>>>
>>> > > Hi all,
>>>
>>> > >
>>>
>>> > > We have been running Airflow in our production for over 8-9 months
>>> now.
>>>
>>> > > I know there is a separate thread in place for Airflow 2.0.
>>>
>>> > > But I was not sure if any of the prior version has this fixed.  If
>>> not, I
>>>
>>> > > will add this to the other email thread for 2.0.
>>>
>>> > >
>>>
>>> > > When I run airflow backfill with "-m"  (Mark jobs as succeeded
>>> without
>>>
>>> > > running them) ,
>>>
>>> > > is there a way to optimize this call?
>>>
>>> > >
>>>
>>> > > For example:
>>>
>>> > > airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e
>>> 2016-12-01T00:00:00
>>>
>>> > -m
>>>
>>> > >
>>>
>>> > > Here, I am running backfill for a month (from 1st Nov to 1st Dec).
>>>
>>> > > Essentially, Marking the jobs as succeeded without running them.
>>>
>>> > >
>>>
>>> > > It has ben more than an hour and the backfill has managed to reach
>>> only
>>>
>>> > > upto 2nd Nov.
>>>
>>> > > This seems to be very slow when there is no need to even run the
>>> tasks.
>>>
>>> > >
>>>
>>> > >
>>>
>>> > > I am running Airflow 1.7.0:
>>>
>>> > > These are my related configuration settings:
>>>
>>> > >
>>>
>>> > > parallelism = 50
>>>
>>> > > dag_concurrency = 20
>>>
>>> > > max_active_runs_per_dag = 8
>>>
>>> > >
>>>
>>> > > Also, I have around 9 Dags running (all Hourly). The other 8 dags are
>>>
>>> > > running as scheduled with start_date of 2016-11-01T00:00:00
>>>
>>> > >
>>>
>>> > > My question is, since I am only Marking the jobs as "succeeded"
>>>
>>> > > without running them,
>>>
>>> > > can this be done over 1 sql query, instead of per hour, per task
>>> basis?
>>>
>>> > > May be find out all the TaskInstances that needs to be mark succeeded
>>>
>>> > > and then just run a sql?
>>>
>>> > >
>>>
>>> > > I may not be aware of lot of things here and very possible I am
>>>
>>> > > assuming a lot of things, incorrectly.
>>>
>>> > > Please feel free to correct me.
>>>
>>> > >
>>>
>>> > >
>>>
>>> > > Thanks,
>>>
>>> > > Harish
>>>
>>> > >
>>>
>>> >
>>>
>>>
>>>
>>
>>
>

Re: Performance: backfill --mark_success

Reply via email to