This is not that helpful of a message, but I also was having a problem with
`airflow backfill -m` on Airflow version 1.7.0 with it going super slow. In
the end I got around the necessity in that specific case, thinking that it
was broken in 1.7.0 re (
https://gitter.im/apache/incubator-airflow?at=56c3956c1f9833ef7c9ba8ee),
but now that I am writing this and triangulating 1.7.0's release and that
gitter comment, it doesn't make sense. I'll give it another go.

On Sun, Dec 4, 2016 at 7:50 AM, Vikas Malhotra <[email protected]>
wrote:

> Hello Harish,
>
> Based on our understanding of Python Multiprocessing, a task instance gets
> a record in underlying database after there is an explicit call to airflow
> from that library (using Local Executor). So, I might be wrong, but you
> won't find a record in database until and unless that task instance has got
> initiated. I might be wrong in our assumptions and would love to be
> corrected if that's the case.
>
> We have been using latest only operator and it's seems to be working well
> for skipping tasks if they are not current (basically avoiding backfill by
> marking all tasks below the latest only operator as skipped). It's present
> in master branch as of now and I would recommend you to look at that
> operator for backfill.
>
> Thanks!
> Vikas
>
> On Dec 4, 2016 5:23 AM, "harish singh" <[email protected]> wrote:
>
> Hi all,
>
> We have been running Airflow in our production for over 8-9 months now.
> I know there is a separate thread in place for Airflow 2.0.
> But I was not sure if any of the prior version has this fixed.  If not, I
> will add this to the other email thread for 2.0.
>
> When I run airflow backfill with "-m"  (Mark jobs as succeeded without
> running them) ,
> is there a way to optimize this call?
>
> For example:
> airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e 2016-12-01T00:00:00 -m
>
> Here, I am running backfill for a month (from 1st Nov to 1st Dec).
> Essentially, Marking the jobs as succeeded without running them.
>
> It has ben more than an hour and the backfill has managed to reach only
> upto 2nd Nov.
> This seems to be very slow when there is no need to even run the tasks.
>
>
> I am running Airflow 1.7.0:
> These are my related configuration settings:
>
> parallelism = 50
> dag_concurrency = 20
> max_active_runs_per_dag = 8
>
> Also, I have around 9 Dags running (all Hourly). The other 8 dags are
> running as scheduled with start_date of 2016-11-01T00:00:00
>
> My question is, since I am only Marking the jobs as "succeeded"
> without running them,
> can this be done over 1 sql query, instead of per hour, per task basis?
> May be find out all the TaskInstances that needs to be mark succeeded
> and then just run a sql?
>
> I may not be aware of lot of things here and very possible I am
> assuming a lot of things, incorrectly.
> Please feel free to correct me.
>
>
> Thanks,
> Harish
>

Reply via email to