[
https://issues.apache.org/jira/browse/AIRFLOW-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on AIRFLOW-5889 started by Darren Weber.
---------------------------------------------
> AWS Batch Operator - API request limits should not fail a task
> --------------------------------------------------------------
>
> Key: AIRFLOW-5889
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5889
> Project: Apache Airflow
> Issue Type: Bug
> Components: aws, contrib
> Affects Versions: 1.10.2, 1.10.3, 1.10.4, 1.10.5, 1.10.6
> Reporter: Darren Weber
> Assignee: Darren Weber
> Priority: Major
> Labels: AWS, aws-batch
> Fix For: 2.0.0, 1.10.7
>
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available
> and has not been merged in years, see
> - [https://github.com/boto/botocore/pull/1307]
> - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this
> means that the fallback is the exponential backoff routine for the status
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs
> is very high (100's of tasks), this fallback polling hits the AWS Batch API
> too hard and the AWS API throttle throws an error, which fails the Airflow
> task, simply because the status is polled too frequently. This results in
> Airflow issuing a retry of this task, when the task is actually running
> already, resulting in duplicate batch jobs. Any exception thrown for an AWS
> API throttle limit should not fail the task, but just pause the polling for
> job status and retry the job status poll.
> This is an example of an API throttle exception:
> {code:java}
> An error occurred (TooManyRequestsException) when calling the DescribeJobs
> operation
> (reached max retries: 4): Too Many Requests
> {code}
> This exception should be handled while waiting for a job to complete, it must
> not result in a job-retry.
> Reduced polling rates help
> (https://issues.apache.org/jira/browse/AIRFLOW-5218), but additional
> exception handling in the polling function is required. Within the exception
> handling code, a random pause on the polling routine could help to alleviate
> the API throttle limits. Maybe the class could expose a parameter for the
> rate of polling (or a callable)?
> Another consideration is possible use of something like the sensor-poke
> approach, with rescheduling, so that the polling process does not occupy a
> worker for the full duration of a batch job, e.g.
> -
> [https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117]
> If a rescheduling approach is adopted, the similar API throttle
> considerations apply.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)