jmaicher opened a new issue, #40882:
URL: https://github.com/apache/airflow/issues/40882

   ### Apache Airflow version
   
   2.9.3
   
   ### If "Other Airflow 2 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   We run airflow with an HA postgres deployment via 
[patroni](https://github.com/patroni/patroni) on k8s. In response to temporary 
network problems, for example, there may be a brief failover to the postgres 
follower, which is in read-only mode. When this happens, write transactions, 
typically heartbeat updates, fail for a brief period with:
   
   > sqlalchemy.exc.InternalError: (psycopg2.errors.ReadOnlySqlTransaction) 
cannot execute UPDATE in a read-only transaction
   
   During such events, which typically resolve within a few seconds at the 
infrastructure level, we observe scheduler crashes and also failing tasks, 
primarily because heartbeat updates fail. Interestingly, the job do not fail 
due to the above error, but instead with:
   
   >  Failed to execute job [...] for task [...] 
((psycopg2.errors.InFailedSqlTransaction) current transaction is aborted, 
commands ignored until end of transaction block)
   
   We traced the problem to the 
[`retry_db_transaction`](https://github.com/apache/airflow/blob/main/airflow/utils/retries.py#L59)
 decorator, which issues retries for `DBAPIError` (and `OperationalError`, 
which is already covered by the former), but only issues a rollback in response 
to `OperationalError`.
   
   The `ReadOnlySqlTransaction` error in our case is wrapped in SQLAlchemy's 
`InternalError`, for which we retry, but do not rollback the transaction. So 
all retries and eventually the whole operation will fail with 
`InFailedSqlTransaction`, regardless of whether the original problem resolved 
at the infrastructure level or not.
   
   ### What you think should happen instead?
   
   The 
[`retry_db_transaction`](https://github.com/apache/airflow/blob/main/airflow/utils/retries.py#L59)
 should recover correctly from these issues.
   
   ### How to reproduce
   
   Start airflow with postgres (e.g. via breeze) and briefly toggle 
[`default_transaction_read_only`](https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-DEFAULT-TRANSACTION-READ-ONLY):
   ```
   postgres=# ALTER SYSTEM SET default_transaction_read_only TO on;
   postgres=# SELECT pg_reload_conf();
   
   postgres=# ALTER SYSTEM SET default_transaction_read_only TO off;
   postgres=# SELECT pg_reload_conf();
   ```
   
   ### Operating System
   
   debian 12 (bookworm)
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Custom HA postgres deployment via 
[patroni](https://github.com/patroni/patroni)
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to