Re: [PR] Fix AWS Lambda executor error handling for DLQ vs task failures [airflow]

via GitHub Thu, 31 Jul 2025 12:10:16 -0700


o-nikolas commented on code in PR #53990:
URL: https://github.com/apache/airflow/pull/53990#discussion_r2246178113



##########
providers/amazon/src/airflow/providers/amazon/aws/executors/aws_lambda/lambda_executor.py:
##########
@@ -425,17 +426,27 @@ def process_queue(self, queue_url: str):
                         "Successful Lambda invocation for task %s received 
from SQS queue.", task_key
                     )
                 else:
-                    # In this case the Lambda likely started but failed at run 
time since we got a non-zero
-                    # return code. We could consider retrying these tasks 
within the executor, because this _likely_
-                    # means the Airflow task did not run to completion, 
however we can't be sure (maybe the
-                    # lambda runtime code has a bug and is returning a 
non-zero when it actually passed?). So
-                    # perhaps not retrying is the safest option.
                     self.fail(task_key)
-                    self.log.error(
-                        "Lambda invocation for task: %s has failed to run with 
return code %s",
-                        task_key,
-                        return_code,
-                    )
+                    if queue_url == self.dlq_url and return_code == None:
+                        # DLQ failure: AWS Lambda service could not complete 
the invocation after retries.
+                        # This indicates a Lambda-level failure (timeout, 
memory limit, crash, etc.) 
+                        # where the function was unable to successfully 
execute to return a result.
+                        self.log.error(
+                            "Lambda invocation for task: %s failed at the 
service level (from DLQ). Command: %s",

Review Comment:
   The command is pretty large in the AF3 case, so I'm not sure we should log 
it (it doesn't really have helpful information either). I also think you have 
some very helpful stuff in the comment that might be nice to include in the 
error message itself to give the user some context and pointers?
   
   What about:
   
   ```suggestion
                               "DLQ message received: Lambda invocation for 
task: %s was unable to successfully execute. This likely indicates a 
Lambda-level failure (timeout, memory limit, crash, etc.).",
   ```



##########
providers/amazon/src/airflow/providers/amazon/aws/executors/aws_lambda/lambda_executor.py:
##########
@@ -425,17 +426,27 @@ def process_queue(self, queue_url: str):
                         "Successful Lambda invocation for task %s received 
from SQS queue.", task_key
                     )
                 else:
-                    # In this case the Lambda likely started but failed at run 
time since we got a non-zero
-                    # return code. We could consider retrying these tasks 
within the executor, because this _likely_
-                    # means the Airflow task did not run to completion, 
however we can't be sure (maybe the
-                    # lambda runtime code has a bug and is returning a 
non-zero when it actually passed?). So
-                    # perhaps not retrying is the safest option.
                     self.fail(task_key)
-                    self.log.error(
-                        "Lambda invocation for task: %s has failed to run with 
return code %s",
-                        task_key,
-                        return_code,
-                    )
+                    if queue_url == self.dlq_url and return_code == None:
+                        # DLQ failure: AWS Lambda service could not complete 
the invocation after retries.
+                        # This indicates a Lambda-level failure (timeout, 
memory limit, crash, etc.)
+                        # where the function was unable to successfully 
execute to return a result.
+                        self.log.error(
+                            "Lambda invocation for task: %s failed at the 
service level (from DLQ). Command: %s",
+                            task_key,
+                            command,
+                        )
+                    else:
+                        # In this case the Lambda likely started but failed at 
run time since we got a non-zero
+                        # return code. We could consider retrying these tasks 
within the executor, because this _likely_
+                        # means the Airflow task did not run to completion, 
however we can't be sure (maybe the
+                        # lambda runtime code has a bug and is returning a 
non-zero when it actually passed?). So
+                        # perhaps not retrying is the safest option.
+                        self.log.error(
+                            "Lambda invocation for task: %s has failed to run 
with return code %s",

Review Comment:
   We should also probably differentiate this one a bit more as well.
   
   ```suggestion
                               "Lambda invocation for task: %s completed but 
the underlying Airflow task has returned a non-zero exit code %s",
   ```
   
   I'm also wondering if this branch should be log.debug? Since it's very 
common for airlfow tasks to fail for a myriad of reasons, and we don't 
necessarily need a error log in the scheduler logs for each one of those. 
Thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix AWS Lambda executor error handling for DLQ vs task failures [airflow]

Reply via email to