jvegreg commented on issue #48001:
URL: https://github.com/apache/airflow/issues/48001#issuecomment-2747479219

   It is in the fine line between bug / edge case oversight / feature request. 
See below for a more detailed explanation with an example
   
   This is our retries configuration (this from our terraform configuration, so 
it may look a bit weird to you)
   ```terraform
   botocore_config = {
       retries = {
           mode = "standard"
           max_attempts = 8
           total_max_attempts = 8
       }
   }
   ```
   
   First, take a look at the log of one of the failures we are facing:
   
   ```
   [2025-03-24, 06:00:40 UTC] {taskinstance.py:3313} ERROR - Task failed with 
exception
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
 line 763, in _execute_task
       result = _execute_callable(context=context, **execute_callable_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
 line 734, in _execute_callable
       return ExecutionCallableRunner(
              ^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/operator_helpers.py",
 line 252, in run
       return self.func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py",
 line 424, in wrapper
       return func(self, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/amazon/aws/operators/ecs.py",
 line 523, in execute
       self._start_task()
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/amazon/aws/operators/ecs.py",
 line 627, in _start_task
       raise EcsOperatorError(failures, response)
   airflow.providers.amazon.aws.exceptions.EcsOperatorError: {'tasks': [], 
'failures': [{'reason': 'Capacity is unavailable at this time. Please try again 
later or in a different availability zone'}], 'ResponseMetadata': {'RequestId': 
'aa8dc53f-298b-40cb-9ab2-e1666dc3c674', 'HTTPStatusCode': 200, 'HTTPHeaders': 
{'x-amzn-requestid': 'aa8dc53f-298b-40cb-9ab2-e1666dc3c674', 'content-type': 
'application/x-amz-json-1.1', 'content-length': '135', 'date': 'Mon, 24 Mar 
2025 06:00:40 GMT'}, 'RetryAttempts': 0}}
   ```
   
   The task is failing to start due to lack of available capacity: `Capacity is 
unavailable at this time. Please try again later or in a different availability 
zone`. This is something that we think airflow should retry automatically, the 
error is even saying `Please try again later` after ll. 
   
   But  when processing the `EcsOperatorError`, the failure `reason` is not 
contemplated in the `should_retry` method: 
   
   ```python
   def should_retry(exception: Exception):
       """Check if exception is related to ECS resource quota (CPU, MEM)."""
       if isinstance(exception, EcsOperatorError):
           return any(
               quota_reason in failure["reason"]
               for quota_reason in ["RESOURCE:MEMORY", "RESOURCE:CPU"]
               for failure in exception.failures
           )
       return False
   ```
   and we think this the reason Airflow is not automatically retrying the task 
as it should. We are not familiar enough with airflow code to be sure if this 
is the case or if the failing to retry is due to another cause. 
   
   Do you think something like this should fix it?
   ```python
   def should_retry(exception: Exception):
       """Check if exception is related to ECS resource quota (CPU, MEM)."""
       if isinstance(exception, EcsOperatorError):
           return any(
               quota_reason in failure["reason"]
               for quota_reason in ["RESOURCE:MEMORY", "RESOURCE:CPU", 
"Capacity is unavailable at this time"]
               for failure in exception.failures
           )
       return False
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to