[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907749#comment-16907749
 ] 

Darren Weber edited comment on AIRFLOW-5218 at 8/15/19 2:04 AM:
----------------------------------------------------------------

Even bumping the backoff factor from `0.1` to `0.3` might help, e.g.
{code:java}
from datetime import datetime
from time import sleep

for retries in range(10):
    pause = 1 + pow(retries * 0.3, 2)
    print(f"{datetime.now()}: retry ({retries:04d}) sleeping for {pause:6.2f} 
sec")
    sleep(pause)

2019-08-14 19:02:58.745923: retry (0000) sleeping for 1.00 sec
2019-08-14 19:02:59.747635: retry (0001) sleeping for 1.09 sec
2019-08-14 19:03:00.840129: retry (0002) sleeping for 1.36 sec
2019-08-14 19:03:02.202734: retry (0003) sleeping for 1.81 sec
2019-08-14 19:03:04.015686: retry (0004) sleeping for 2.44 sec
2019-08-14 19:03:06.458972: retry (0005) sleeping for 3.25 sec
2019-08-14 19:03:09.713452: retry (0006) sleeping for 4.24 sec
2019-08-14 19:03:13.954253: retry (0007) sleeping for 5.41 sec
2019-08-14 19:03:19.368445: retry (0008) sleeping for 6.76 sec
2019-08-14 19:03:26.135600: retry (0009) sleeping for 8.29 sec

{code}


was (Author: dazza):
Even bumping the backoff factor from `0.1` to `0.3` might help, e.g.
{code}
from datetime import datetime
from time import sleep

In [18]: for i in [1 + pow(retries * 0.3, 2) for retries in range(10)]: 
    ...:     print(f"{datetime.now()}: sleeping for {i}") 
    ...:     sleep(i) 
    ...:                                                                        
                                                                                
                                              
2019-08-14 18:52:01.688705: sleeping for 1.0
2019-08-14 18:52:02.690385: sleeping for 1.09
2019-08-14 18:52:03.781384: sleeping for 1.3599999999999999
2019-08-14 18:52:05.144492: sleeping for 1.8099999999999998
2019-08-14 18:52:06.956547: sleeping for 2.44
2019-08-14 18:52:09.401454: sleeping for 3.25
2019-08-14 18:52:12.652212: sleeping for 4.239999999999999
2019-08-14 18:52:16.897060: sleeping for 5.41
2019-08-14 18:52:22.313692: sleeping for 6.76
2019-08-14 18:52:29.082087: sleeping for 8.29
{code}

> AWS Batch Operator - status polling too often, esp. for high concurrency
> ------------------------------------------------------------------------
>
>                 Key: AIRFLOW-5218
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: aws, contrib
>    Affects Versions: 1.10.4
>            Reporter: Darren Weber
>            Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> {code:java}
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
>  Out[4]: 
>  [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1600000000000001,
>  1.25,
>  1.36,
>  1.4900000000000002,
>  1.6400000000000001,
>  1.81,
>  2.0,
>  2.21,
>  2.4400000000000004,
>  2.6900000000000004,
>  2.9600000000000004,
>  3.25,
>  3.5600000000000005,
>  3.8900000000000006,
>  4.24,
>  4.61]{code}
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up. 
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help. Since batch jobs tend 
> to be long-running jobs (rather than near-real time jobs), it might help to 
> issue less frequent polls when it's in the RUNNING state. Something on the 
> order of 10's seconds might be reasonable for batch jobs? Maybe the class 
> could expose a parameter for the rate of polling (or a callable)?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to