[ 
https://issues.apache.org/jira/browse/AIRFLOW-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

belgacea updated AIRFLOW-3335:
------------------------------
    Description: 
I'm using Airflow to schedule Spark jobs and I wanted to be able to `backfill` 
a large time range (to catch up dags that are far beyond their schedules). I 
used the `backfill` command with the `mark_success` argument and I thought all 
the dagruns would be marked as successful in a second, but airflow seems to 
mark dags one by one (with some parallelization, using the 
`parallelism`/`dag_concurrency` configuration). Each dag take approximately 2 
seconds to be marked as succeed and this makes the backfill process really slow 
for a large time range (or for small `time intervals`).

Is there a way to speed up the `mark_success` bakfilling ? And also is there a 
way to tell to Airflow scheduler to backfill dags with a single instance per 
task using the specified backfill time range (`start_date` + `end_date`) and 
then mark as succeed all dagruns within the time range ? 

Note : The dag I tried to backfill doesn't `depends_on_past`.

  was:
I'm using Airflow to schedule Spark jobs and I wanted to be able to `backfill` 
a large time range (to catch up dags that are far beyond their schedules). I 
used the `backfill` command with the `mark_success` argument and I was thinking 
that all dagrun will be marked as succeed in a second, but airflow seems to 
mark dags one by one (with some parallelization, using the 
`parallelism`/`dag_concurrency` configuration). Each dag take approximately 2 
seconds to be marked as succeed and this makes the backfill process really slow 
for a large time range (or for small `time intervals`).

Is there a way to speed up the `mark_success` bakfilling ? And also is there a 
way to tell to Airflow scheduler to backfill dags with a single instance per 
task using the specified backfill time range (`start_date` + `end_date`) and 
then mark as succeed all dagruns within the time range ? 

Note : The dag I tried to backfill doesn't `depends_on_past`.


> Bulk backfill & faster mark_success
> -----------------------------------
>
>                 Key: AIRFLOW-3335
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3335
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: backfill
>            Reporter: belgacea
>            Priority: Major
>              Labels: features, performance
>
> I'm using Airflow to schedule Spark jobs and I wanted to be able to 
> `backfill` a large time range (to catch up dags that are far beyond their 
> schedules). I used the `backfill` command with the `mark_success` argument 
> and I thought all the dagruns would be marked as successful in a second, but 
> airflow seems to mark dags one by one (with some parallelization, using the 
> `parallelism`/`dag_concurrency` configuration). Each dag take approximately 2 
> seconds to be marked as succeed and this makes the backfill process really 
> slow for a large time range (or for small `time intervals`).
> Is there a way to speed up the `mark_success` bakfilling ? And also is there 
> a way to tell to Airflow scheduler to backfill dags with a single instance 
> per task using the specified backfill time range (`start_date` + `end_date`) 
> and then mark as succeed all dagruns within the time range ? 
> Note : The dag I tried to backfill doesn't `depends_on_past`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to