Re: [PR] Fix AWS Lambda executor error handling for DLQ vs task failures [airflow]

via GitHub Thu, 31 Jul 2025 13:05:09 -0700


o-nikolas commented on code in PR #53990:
URL: https://github.com/apache/airflow/pull/53990#discussion_r2246275473



##########
providers/amazon/src/airflow/providers/amazon/aws/executors/aws_lambda/lambda_executor.py:
##########
@@ -425,17 +426,27 @@ def process_queue(self, queue_url: str):
                         "Successful Lambda invocation for task %s received 
from SQS queue.", task_key
                     )
                 else:
-                    # In this case the Lambda likely started but failed at run 
time since we got a non-zero
-                    # return code. We could consider retrying these tasks 
within the executor, because this _likely_
-                    # means the Airflow task did not run to completion, 
however we can't be sure (maybe the
-                    # lambda runtime code has a bug and is returning a 
non-zero when it actually passed?). So
-                    # perhaps not retrying is the safest option.
                     self.fail(task_key)
-                    self.log.error(
-                        "Lambda invocation for task: %s has failed to run with 
return code %s",
-                        task_key,
-                        return_code,
-                    )
+                    if queue_url == self.dlq_url and return_code == None:
+                        # DLQ failure: AWS Lambda service could not complete 
the invocation after retries.
+                        # This indicates a Lambda-level failure (timeout, 
memory limit, crash, etc.)
+                        # where the function was unable to successfully 
execute to return a result.
+                        self.log.error(
+                            "Lambda invocation for task: %s failed at the 
service level (from DLQ). Command: %s",
+                            task_key,
+                            command,
+                        )
+                    else:
+                        # In this case the Lambda likely started but failed at 
run time since we got a non-zero
+                        # return code. We could consider retrying these tasks 
within the executor, because this _likely_
+                        # means the Airflow task did not run to completion, 
however we can't be sure (maybe the
+                        # lambda runtime code has a bug and is returning a 
non-zero when it actually passed?). So
+                        # perhaps not retrying is the safest option.
+                        self.log.error(
+                            "Lambda invocation for task: %s has failed to run 
with return code %s",

Review Comment:
   Yeah, that's fair, but in that situation the task still got marked as 
failed, and that will show up in the Airflow UI, scheduler logs, etc. So users 
will be more than aware it failed and can investigate and figure out the cause. 
I'm just not sure the Lambda executor _also_ needs to raise alarms for those 
failures. Feels redundant and like it most often will just cause excess churn 
in the logs.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix AWS Lambda executor error handling for DLQ vs task failures [airflow]

Reply via email to