[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darren Weber reassigned AIRFLOW-5218:
-------------------------------------

    Assignee: Darren Weber

> AWS Batch Operator - status polling too often, esp. for high concurrency
> ------------------------------------------------------------------------
>
>                 Key: AIRFLOW-5218
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: aws, contrib
>    Affects Versions: 1.10.4
>            Reporter: Darren Weber
>            Assignee: Darren Weber
>            Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> {code:java}
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
>  Out[4]: 
>  [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1600000000000001,
>  1.25,
>  1.36,
>  1.4900000000000002,
>  1.6400000000000001,
>  1.81,
>  2.0,
>  2.21,
>  2.4400000000000004,
>  2.6900000000000004,
>  2.9600000000000004,
>  3.25,
>  3.5600000000000005,
>  3.8900000000000006,
>  4.24,
>  4.61]{code}
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up. 
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help. Since batch jobs tend 
> to be long-running jobs (rather than near-real time jobs), it might help to 
> issue less frequent polls when it's in the RUNNING state. Something on the 
> order of 10's seconds might be reasonable for batch jobs? Maybe the class 
> could expose a parameter for the rate of polling (or a callable)?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to