0x26res opened a new issue, #37120:
URL: https://github.com/apache/airflow/issues/37120

   ### Apache Airflow Provider(s)
   
   amazon
   
   ### Versions of Apache Airflow Providers
   
   I use airflow to orchestrate Aws batch jobs. Since aws batch is doing the 
heavy lifting, and to save resources on airflow, I'm was using smart sensors 
(in 2.4.3). It looks like this:
   
   ```python
           with TaskGroup(group_id="job_abc") as group:
               job = BatchOperator(
                   task_id=f"submit_job_abc",
                   job_name="job_abc",
                   max_retries=0,
                   wait_for_completion=False,
               )
               BatchSensor(
                   task_id=f"wait_for_job_abc",
                   job_id=job.output,  # type: ignore
                   mode="reschedule",
               )
   ```
   
   Please note that I set the `BatchOperator` `wait_for_completion=False` to 0 
so it only submits the job (fire and forget). This means I can also set 
`max_retries=0` as submitting jobs will only fail if there's an issue 
validating the job definition.
   
   In the BatchSensor I set the max_retries to 5 which is the default. When the 
BatchSensor poke/poll for job completion, if the job is being submitted, 
starting, or running, it doesn't count it as a failed attempt.
   
   
   
   I'm in the process of updating to 2.7.2 and smart sensors are no longer 
supported, and I should use deferred operator. So I set 
`BatchSensor.deferrable` to True:
   
   ```python
           with TaskGroup(group_id="job_abc") as group:
               job = BatchOperator(
                   task_id=f"submit_job_abc",
                   job_name="job_abc",
                   max_retries=0,
                   wait_for_completion=False,
               )
               BatchSensor(
                   task_id=f"wait_for_job_abc",
                   job_id=job.output,  # type: ignore
                   mode="reschedule",
                  deferrable=True,
               )
   ```
   
   I've noticed that the interpretation of `max_retries` for the BatchSensor 
has changed. For instance it will assume that if the job is in RUNNABLE, 
STARTING or RUNNING state, it is a failed attempt:
   
   ```
   [2024-01-31, 14:20:57 UTC] {waiter_with_logging.py:129} INFO - Batch job 
035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['RUNNABLE']
   [2024-01-31, 14:21:02 UTC] {waiter_with_logging.py:129} INFO - Batch job 
035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']
   [2024-01-31, 14:21:07 UTC] {waiter_with_logging.py:129} INFO - Batch job 
035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']
   [2024-01-31, 14:21:12 UTC] {waiter_with_logging.py:129} INFO - Batch job 
035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']
   [2024-01-31, 14:21:17 UTC] {waiter_with_logging.py:129} INFO - Batch job 
035c18e0-936a-4719-bfa5-cdd372c12a25 not ready yet: ['STARTING']d
   airflow.exceptions.AirflowException: Waiter error: max attempts reached
   ```
   
   So the previous version would consider RUNNABLE/STARTING/RUNNING jobs not as 
a failed attempt, and only consider a fail attempt if the underlying job failed 
or if there was a transient / transport failure when checking the job status.
   
   The new version with deferrable will count any poke at the job where the job 
is not completed as a failure (even though the job hasn't failed). In the light 
of this change of behaviour, should the max_retries be set to 4200 and the 
poll_interval to 30 (from 5), like it has been done for the 
[BatchOperator](https://github.com/apache/airflow/pull/33045)
   
   ### Apache Airflow version
   
   2.7.2
   
   ### Operating System
   
   linux
   
   ### Deployment
   
   Amazon (AWS) MWAA
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   _No response_
   
   ### What you think should happen instead
   
   _No response_
   
   ### How to reproduce
   
   # 
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to