0x26res opened a new issue, #37120:
URL: https://github.com/apache/airflow/issues/37120
### Apache Airflow Provider(s)
amazon
### Versions of Apache Airflow Providers
I use airflow to orchestrate Aws batch jobs. Since aws batch is doing the
heavy lifting, and to save resources on airflow, I'm was using smart sensors
(in 2.4.3). It looks like this:
```python
with TaskGroup(group_id="job_abc") as group:
job = BatchOperator(
task_id=f"submit_job_abc",
job_name="job_abc",
max_retries=0,
wait_for_completion=False,
)
BatchSensor(
task_id=f"wait_for_job_abc",
job_id=job.output, # type: ignore
mode="reschedule",
)
```
Please note that I set the `BatchOperator` `wait_for_completion=False` to 0
so it only submits the job (fire and forget). This means I can also set
`max_retries=0` as submitting jobs will only fail if there's an issue
validating the job definition.
In the BatchSensor I set the max_retries to 5 which is the default. When the
BatchSensor poke/poll for job completion, if the job is being submitted,
starting, or running, it doesn't count it as a failed attempt.
I'm in the process of updating to 2.7.2 and smart sensors are no longer
supported, and I should use deferred operator. So I set
`BatchSensor.deferrable` to True:
```python
with TaskGroup(group_id="job_abc") as group:
job = BatchOperator(
task_id=f"submit_job_abc",
job_name="job_abc",
max_retries=0,
wait_for_completion=False,
)
BatchSensor(
task_id=f"wait_for_job_abc",
job_id=job.output, # type: ignore
mode="reschedule",
deferrable=True,
)
```
I've noticed that the interpretation of `max_retries` for the BatchSensor
has changed. For instance it will assume that if the job is in RUNNABLE,
STARTING or RUNNING state, it is a failed attempt:
```
[2024-01-31, 14:20:57 UTC] {waiter_with_logging.py:129} INFO - Batch job
035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['RUNNABLE']
[2024-01-31, 14:21:02 UTC] {waiter_with_logging.py:129} INFO - Batch job
035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']
[2024-01-31, 14:21:07 UTC] {waiter_with_logging.py:129} INFO - Batch job
035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']
[2024-01-31, 14:21:12 UTC] {waiter_with_logging.py:129} INFO - Batch job
035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']
[2024-01-31, 14:21:17 UTC] {waiter_with_logging.py:129} INFO - Batch job
035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']d
airflow.exceptions.AirflowException: Waiter error: max attempts reached
```
So the previous version would consider RUNNABLE/STARTING/RUNNING jobs not as
a failed attempt, and only consider a fail attempt if the underlying job failed
or if there was a transient / transport failure when checking the job status.
The new version with deferrable will count any poke at the job where the job
is not completed as a failure (even though the job hasn't failed). In the light
of this change of behaviour, should the max_retries be set to 4200 and the
poll_interval to 30 (from 5), like it has been done for the
[BatchOperator](https://github.com/apache/airflow/pull/33045)
### Apache Airflow version
2.7.2
### Operating System
linux
### Deployment
Amazon (AWS) MWAA
### Deployment details
_No response_
### What happened
_No response_
### What you think should happen instead
_No response_
### How to reproduce
#
### Anything else
_No response_
### Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]