[ 
https://issues.apache.org/jira/browse/AIRFLOW-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on AIRFLOW-5889 started by Darren Weber.
---------------------------------------------
> AWS Batch Operator - API request limits should not fail a task
> --------------------------------------------------------------
>
>                 Key: AIRFLOW-5889
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5889
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: aws, contrib
>    Affects Versions: 1.10.2, 1.10.3, 1.10.4, 1.10.5, 1.10.6
>            Reporter: Darren Weber
>            Assignee: Darren Weber
>            Priority: Major
>              Labels: AWS, aws-batch
>             Fix For: 2.0.0, 1.10.7
>
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.  This results in 
> Airflow issuing a retry of this task, when the task is actually running 
> already, resulting in duplicate batch jobs.  Any exception thrown for an AWS 
> API throttle limit should not fail the task, but just pause the polling for 
> job status and retry the job status poll.
> This is an example of an API throttle exception:
> {code:java}
> An error occurred (TooManyRequestsException) when calling the DescribeJobs 
> operation
> (reached max retries: 4): Too Many Requests
> {code}
> This exception should be handled while waiting for a job to complete, it must 
> not result in a job-retry.
> Reduced polling rates help 
> (https://issues.apache.org/jira/browse/AIRFLOW-5218), but additional 
> exception handling in the polling function is required.  Within the exception 
> handling code, a random pause on the polling routine could help to alleviate 
> the API throttle limits.  Maybe the class could expose a parameter for the 
> rate of polling (or a callable)?
> Another consideration is possible use of something like the sensor-poke 
> approach, with rescheduling, so that the polling process does not occupy a 
> worker for the full duration of a batch job, e.g.
> - 
> [https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117]
> If a rescheduling approach is adopted, the similar API throttle 
> considerations apply.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to