jvegreg commented on issue #48001:
URL: https://github.com/apache/airflow/issues/48001#issuecomment-2747479219
It is in the fine line between bug / edge case oversight / feature request.
See below for a more detailed explanation with an example
This is our retries configuration (this from our terraform configuration, so
it may look a bit weird to you)
```terraform
botocore_config = {
retries = {
mode = "standard"
max_attempts = 8
total_max_attempts = 8
}
}
```
First, take a look at the log of one of the failures we are facing:
```
[2025-03-24, 06:00:40 UTC] {taskinstance.py:3313} ERROR - Task failed with
exception
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
line 763, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
line 734, in _execute_callable
return ExecutionCallableRunner(
^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/operator_helpers.py",
line 252, in run
return self.func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py",
line 424, in wrapper
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/amazon/aws/operators/ecs.py",
line 523, in execute
self._start_task()
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/amazon/aws/operators/ecs.py",
line 627, in _start_task
raise EcsOperatorError(failures, response)
airflow.providers.amazon.aws.exceptions.EcsOperatorError: {'tasks': [],
'failures': [{'reason': 'Capacity is unavailable at this time. Please try again
later or in a different availability zone'}], 'ResponseMetadata': {'RequestId':
'aa8dc53f-298b-40cb-9ab2-e1666dc3c674', 'HTTPStatusCode': 200, 'HTTPHeaders':
{'x-amzn-requestid': 'aa8dc53f-298b-40cb-9ab2-e1666dc3c674', 'content-type':
'application/x-amz-json-1.1', 'content-length': '135', 'date': 'Mon, 24 Mar
2025 06:00:40 GMT'}, 'RetryAttempts': 0}}
```
The task is failing to start due to lack of available capacity: `Capacity is
unavailable at this time. Please try again later or in a different availability
zone`. This is something that we think airflow should retry automatically, the
error is even saying `Please try again later` after ll.
But when processing the `EcsOperatorError`, the failure `reason` is not
contemplated in the `should_retry` method:
```python
def should_retry(exception: Exception):
"""Check if exception is related to ECS resource quota (CPU, MEM)."""
if isinstance(exception, EcsOperatorError):
return any(
quota_reason in failure["reason"]
for quota_reason in ["RESOURCE:MEMORY", "RESOURCE:CPU"]
for failure in exception.failures
)
return False
```
and we think this the reason Airflow is not automatically retrying the task
as it should. We are not familiar enough with airflow code to be sure if this
is the case or if the failing to retry is due to another cause.
Do you think something like this should fix it?
```python
def should_retry(exception: Exception):
"""Check if exception is related to ECS resource quota (CPU, MEM)."""
if isinstance(exception, EcsOperatorError):
return any(
quota_reason in failure["reason"]
for quota_reason in ["RESOURCE:MEMORY", "RESOURCE:CPU",
"Capacity is unavailable at this time"]
for failure in exception.failures
)
return False
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]