[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907749#comment-16907749
 ] 

Darren Weber commented on AIRFLOW-5218:
---------------------------------------

Even bumping the backoff factor from `0.1` to `0.3` might help, e.g.
{code}
from datetime import datetime
from time import sleep

In [18]: for i in [1 + pow(retries * 0.3, 2) for retries in range(10)]: 
    ...:     print(f"{datetime.now()}: sleeping for {i}") 
    ...:     sleep(i) 
    ...:                                                                        
                                                                                
                                              
2019-08-14 18:52:01.688705: sleeping for 1.0
2019-08-14 18:52:02.690385: sleeping for 1.09
2019-08-14 18:52:03.781384: sleeping for 1.3599999999999999
2019-08-14 18:52:05.144492: sleeping for 1.8099999999999998
2019-08-14 18:52:06.956547: sleeping for 2.44
2019-08-14 18:52:09.401454: sleeping for 3.25
2019-08-14 18:52:12.652212: sleeping for 4.239999999999999
2019-08-14 18:52:16.897060: sleeping for 5.41
2019-08-14 18:52:22.313692: sleeping for 6.76
2019-08-14 18:52:29.082087: sleeping for 8.29
{code}

> AWS Batch Operator - status polling too often, esp. for high concurrency
> ------------------------------------------------------------------------
>
>                 Key: AIRFLOW-5218
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: aws, contrib
>    Affects Versions: 1.10.4
>            Reporter: Darren Weber
>            Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
> - https://github.com/boto/botocore/pull/1307
> - see also https://github.com/broadinstitute/cromwell/issues/4303
> This is a curious case of premature optimization.  So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job.  Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec.  When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)]                  
>                                                                               
>                                                   
> Out[4]: 
> [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1600000000000001,
>  1.25,
>  1.36,
>  1.4900000000000002,
>  1.6400000000000001,
>  1.81,
>  2.0,
>  2.21,
>  2.4400000000000004,
>  2.6900000000000004,
>  2.9600000000000004,
>  3.25,
>  3.5600000000000005,
>  3.8900000000000006,
>  4.24,
>  4.61]
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up.  
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help.  Since batch jobs 
> tend to be long-running jobs (rather than near-real time jobs), it might help 
> to issue less frequent polls when it's in the RUNNING state.  Something on 
> the order of 10's seconds might be reasonable for batch jobs?  Maybe the 
> class could expose a parameter for the rate of polling (or a callable)?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to