[
https://issues.apache.org/jira/browse/AIRFLOW-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Darren Weber updated AIRFLOW-5889:
----------------------------------
Description:
The AWS Batch Operator attempts to use a boto3 feature that is not available
and has not been merged in years, see
- [https://github.com/boto/botocore/pull/1307]
- see also [https://github.com/broadinstitute/cromwell/issues/4303]
This is a curious case of premature optimization. So, in the meantime, this
means that the fallback is the exponential backoff routine for the status
checks on the batch job. Unfortunately, when the concurrency of Airflow jobs is
very high (100's of tasks), this fallback polling hits the AWS Batch API too
hard and the AWS API throttle throws an error, which fails the Airflow task,
simply because the status is polled too frequently. This results in Airflow
issuing a retry of this task, when the task is actually running already,
resulting in duplicate batch jobs. Any exception thrown for an AWS API
throttle limit should not fail the task, but just pause the polling for job
status and retry the job status poll.
Reduced polling rates help
(https://issues.apache.org/jira/browse/AIRFLOW-5218), but additional exception
handling in the polling function is required. Within the exception handling
code, a random pause on the polling routine could help to alleviate the API
throttle limits. Maybe the class could expose a parameter for the rate of
polling (or a callable)?
Another consideration is possible use of something like the sensor-poke
approach, with rescheduling, so that the polling process does not occupy a
worker for the full duration of a batch job, e.g.
-
[https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117]
If a rescheduling approach is adopted, the similar API throttle considerations
apply.
was:
The AWS Batch Operator attempts to use a boto3 feature that is not available
and has not been merged in years, see
- [https://github.com/boto/botocore/pull/1307]
- see also [https://github.com/broadinstitute/cromwell/issues/4303]
This is a curious case of premature optimization. So, in the meantime, this
means that the fallback is the exponential backoff routine for the status
checks on the batch job. Unfortunately, when the concurrency of Airflow jobs is
very high (100's of tasks), this fallback polling hits the AWS Batch API too
hard and the AWS API throttle throws an error, which fails the Airflow task,
simply because the status is polled too frequently.
Check the output from the retry algorithm, e.g. within the first 10 retries,
the status of an AWS batch job is checked about 10 times at a rate that is
approx 1 retry/sec. When an Airflow instance is running 10's or 100's of
concurrent batch jobs, this hits the API too frequently and crashes the Airflow
task (plus it occupies a worker in too much busy work).
{code:java}
In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)]
Out[4]:
[1.0,
1.01,
1.04,
1.09,
1.1600000000000001,
1.25,
1.36,
1.4900000000000002,
1.6400000000000001,
1.81,
2.0,
2.21,
2.4400000000000004,
2.6900000000000004,
2.9600000000000004,
3.25,
3.5600000000000005,
3.8900000000000006,
4.24,
4.61]{code}
Possible solutions are to introduce an initial sleep (say 60 sec?) right after
issuing the request, so that the batch job has some time to spin up. The job
progresses through a through phases before it gets to RUNNING state and polling
for each phase of that sequence might help. Since batch jobs tend to be
long-running jobs (rather than near-real time jobs), it might help to issue
less frequent polls when it's in the RUNNING state. Something on the order of
10's seconds might be reasonable for batch jobs? Maybe the class could expose a
parameter for the rate of polling (or a callable)?
Another option is to use something like the sensor-poke approach, with
rescheduling, e.g.
-
[https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117]
> AWS Batch Operator - API request limits should not fail a task
> --------------------------------------------------------------
>
> Key: AIRFLOW-5889
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5889
> Project: Apache Airflow
> Issue Type: Improvement
> Components: aws, contrib
> Affects Versions: 1.10.4
> Reporter: Darren Weber
> Assignee: Darren Weber
> Priority: Major
> Fix For: 2.0.0, 1.10.6
>
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available
> and has not been merged in years, see
> - [https://github.com/boto/botocore/pull/1307]
> - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this
> means that the fallback is the exponential backoff routine for the status
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs
> is very high (100's of tasks), this fallback polling hits the AWS Batch API
> too hard and the AWS API throttle throws an error, which fails the Airflow
> task, simply because the status is polled too frequently. This results in
> Airflow issuing a retry of this task, when the task is actually running
> already, resulting in duplicate batch jobs. Any exception thrown for an AWS
> API throttle limit should not fail the task, but just pause the polling for
> job status and retry the job status poll.
> Reduced polling rates help
> (https://issues.apache.org/jira/browse/AIRFLOW-5218), but additional
> exception handling in the polling function is required. Within the exception
> handling code, a random pause on the polling routine could help to alleviate
> the API throttle limits. Maybe the class could expose a parameter for the
> rate of polling (or a callable)?
> Another consideration is possible use of something like the sensor-poke
> approach, with rescheduling, so that the polling process does not occupy a
> worker for the full duration of a batch job, e.g.
> -
> [https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117]
> If a rescheduling approach is adopted, the similar API throttle
> considerations apply.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)