[ 
https://issues.apache.org/jira/browse/AIRFLOW-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Nichols updated AIRFLOW-5820:
----------------------------------
    Description: 
I am new to Airflow. I made a simple task in a trivial DAG. It takes 0.004 
seconds to fill the DagBag, and the task takes only 3 seconds to run.

Max concurrency must be set to 1 since my task hits a public API with a rate 
limit in effect.

I set it up to backfill 3 years of data; so I need to run the task ~1000 times 
in sequence. This should take ~3000 seconds.

Unfortunately, Airflow spends 3 seconds running the task, and then waits around 
40 seconds before starting the next day of the backfill. So more than 90% of 
the time is Airflow spinning, and the job takes more than 10x longer than 
required.

I think there should be a way to make backfill jobs run quickly, one after 
another, in this very simple case I have described. There is simply not 40 
seconds worth of necessary compute to do between tasks. 

  was:
I am new to Airflow. I made a simple task in a trivial DAG. It takes 0.004 
seconds to fill the DagBag, and the task takes only 3 seconds to run.

Max concurrency must be set to 1 since my task hits a public API with a rate 
limit in effect.

I set it up to backfill 3 years of data; so I need to run the task ~1000 times 
in sequence. This should take ~3000 seconds.

Unfortunately, Airflow spends 3 seconds running the task, and then waits around 
40 seconds before starting the next day of the backfill. So more than 90% of 
the time is Airflow spinning, and the job takes ~10x longer than required.

I think there should be a way to make backfill jobs run quickly, one after 
another, in this very simple case I have described. There is simply not 40 
seconds worth of necessary compute to do between tasks. 


> Long delay between individual tasks in a large backfill
> -------------------------------------------------------
>
>                 Key: AIRFLOW-5820
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5820
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: backfill
>    Affects Versions: 1.10.5
>         Environment: Ubuntu 18
>            Reporter: Eric Nichols
>            Priority: Major
>
> I am new to Airflow. I made a simple task in a trivial DAG. It takes 0.004 
> seconds to fill the DagBag, and the task takes only 3 seconds to run.
> Max concurrency must be set to 1 since my task hits a public API with a rate 
> limit in effect.
> I set it up to backfill 3 years of data; so I need to run the task ~1000 
> times in sequence. This should take ~3000 seconds.
> Unfortunately, Airflow spends 3 seconds running the task, and then waits 
> around 40 seconds before starting the next day of the backfill. So more than 
> 90% of the time is Airflow spinning, and the job takes more than 10x longer 
> than required.
> I think there should be a way to make backfill jobs run quickly, one after 
> another, in this very simple case I have described. There is simply not 40 
> seconds worth of necessary compute to do between tasks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to