Shlomit-B opened a new pull request, #52769:
URL: https://github.com/apache/airflow/pull/52769
### Summary
Adds full support for AWS Systems Manager (SSM) Run Command in Apache
Airflow, including:
- SsmRunCommandOperator — sends commands to resources via SSM
- SsmRunCommandCompletedSensor — waits for command completion using
list_command_invocations.
- Checks the status of all target resources for the given command
invocation.
- It waits until all resources have completed the command successfully.
- If any resource reports a failure state, the sensor will fail
immediately, providing faster and clearer feedback on execution issues.
- SsmRunCommandTrigger — enables deferrable execution
- SsmCommandWaiter — internal waiter for command completion logic
- Unit tests for each component: operator, sensor, trigger, and waiter
- System test that exercises the full flow on a live EC2 instance
### Also included
- Refactored an EC2 task function into ec2_utils.py for reuse between tests
- Added iam.py utility file under system tests with create_iam_role helper
function
### Sensor design choices and rationale
When implementing SsmRunCommandCompletedSensor, I considered several
approaches for handling command status across multiple target resources:
- Checking only one resource’s status was dismissed early on. In most cases,
users expect the sensor to reflect the overall success of the command. If even
one instance fails, it's likely the user wouldn’t want downstream tasks to
continue.
- (Chosen) Fail immediately if any resource fails: This approach provides
faster feedback and aligns with the principle of failing fast. It prevents
downstream tasks from executing when even a single resource didn’t succeed,
which reflects real-world expectations in multi-instance workflows.
This logic improves reliability and better aligns with common use cases for
SSM Run Command in production environments.
### Open question for reviewers
Would it make sense to make the sensor behavior configurable?
For example, by adding a parameter like fail_on_any_failure=True to let
users choose whether to fail immediately or wait for all command invocations to
complete before deciding.
Currently, the sensor always fails as soon as one resource reports failure.
This seems like the safer default, but I’m open to making it
user-configurable if that would better serve broader use cases.
closes: #42619
---
Read the **[Pull Request
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
for more information.
In case of fundamental code changes, an Airflow Improvement Proposal
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
is needed.
In case of a new dependency, check compliance with the [ASF 3rd Party
License Policy](https://www.apache.org/legal/resolved.html#category-x).
In case of backwards incompatible changes please leave a note in a
newsfragment file, named `{pr_number}.significant.rst` or
`{issue_number}.significant.rst`, in
[airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]