potiuk commented on a change in pull request #10933:
URL: https://github.com/apache/airflow/pull/10933#discussion_r488078939
##########
File path: airflow/utils/orm_event_handlers.py
##########
@@ -52,6 +55,40 @@ def set_mysql_timezone(dbapi_connection, connection_record):
cursor.execute("SET time_zone = '+00:00'")
cursor.close()
+ is_mariadb = engine.dialect.server_version_info and "MariaDB" in
engine.dialect.server_version_info
+ if idle_session_timeout > 0:
+ if is_mariadb:
+ #
https://mariadb.com/kb/en/mariadb-1032-release-notes/#variables
+ if engine.dialect.server_version_info > (10, 3):
+ setting = "idle_write_transaction_timeout"
+ else:
+ setting = "idle_readwrite_transaction_timeout"
+
+ @event.listens_for(engine, "connect")
+ def set_mysql_transaction_idle_timeout(dbapi_connection,
connection_record):
+ cursor = dbapi_connection.cursor()
+ cursor.execute("SET @@SESSION.%s = %s", (setting,
int(idle_session_timeout),))
+ cursor.close()
+ else:
+ # MySQL doesn't have the equivalent of postgres'
idle_in_transaction_session_timeout -- the
+ # best we can do is set the time we wait just set the timeout
for any idle connection
+ @event.listens_for(engine, "connect")
+ def set_mysql_transaction_idle_timeout(dbapi_connection,
connection_record):
Review comment:
I think the only way to properly test it is to engage our users with
some test/staging systems where they can simulate and test with production-like
loads. I think the nature of this makes it super difficult to test. The problem
is that the low-level behaviour that we have no way to control from the
"higher-layer" logic which depends on factors that are pretty much random (i.e.
how many transactions we run in parallel, whether some other threads were using
the connections recently and returned it to the pool and how much load on the
database we have in general (how responsive the DB server is).
I can easily imagine scenarios where - for example - some workers will get
periodic bursts of timed-out connections for all their threads when other
workers start to do a lot of database transactions - because those workers will
**just** cross the timeout limit and all of them start failing. Or a periodic
failure of random tasks without a reason we could track or mitigate. Those are
the most difficult things to track.
I do not have a very concrete scenario that could be easily simulated - It's
more from intuition and experiencing similar problems in telecom industry years
ago. We can try to mitigate it for example by adding some randomness around the
timeout values and other things. but I think there is a good reason why the
default timeout is 8 hrs and not say - 2 minutes. That makes me a little
worried.
We might ask our customers to try this out once the whole thing is merged
(including HA Scheduler). But I think during the process I would like to
understand how critical it is to have this timeout. Maybe we can try some other
approaches if all that we want to protect from is avoiding stale locks. Seems
to me this is a bit risky to potentially change behaviour of all transactions
when we want to protect from something that will happen very rarely.
For example, I could easily imagine we could run a periodic monitor to check
if there are stale locks and selectively kill only those sessions.? That might
call for a solution which is much more database-specific, but it might be
less-risky, more efficient approach.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]