[GitHub] [airflow] potiuk commented on a change in pull request #10933: Ensure idle database connections don't hang around for ever

GitBox Mon, 14 Sep 2020 09:49:19 -0700


potiuk commented on a change in pull request #10933:
URL: https://github.com/apache/airflow/pull/10933#discussion_r488078939




##########
File path: airflow/utils/orm_event_handlers.py
##########
@@ -52,6 +55,40 @@ def set_mysql_timezone(dbapi_connection, connection_record):
             cursor.execute("SET time_zone = '+00:00'")
             cursor.close()
 
+        is_mariadb = engine.dialect.server_version_info and "MariaDB" in 
engine.dialect.server_version_info
+        if idle_session_timeout > 0:
+            if is_mariadb:
+                # 
https://mariadb.com/kb/en/mariadb-1032-release-notes/#variables
+                if engine.dialect.server_version_info > (10, 3):
+                    setting = "idle_write_transaction_timeout"
+                else:
+                    setting = "idle_readwrite_transaction_timeout"
+
+                @event.listens_for(engine, "connect")
+                def set_mysql_transaction_idle_timeout(dbapi_connection, 
connection_record):
+                    cursor = dbapi_connection.cursor()
+                    cursor.execute("SET @@SESSION.%s = %s", (setting, 
int(idle_session_timeout),))
+                    cursor.close()
+            else:
+                # MySQL doesn't have the equivalent of postgres' 
idle_in_transaction_session_timeout -- the
+                # best we can do is set the time we wait just set the timeout 
for any idle connection
+                @event.listens_for(engine, "connect")
+                def set_mysql_transaction_idle_timeout(dbapi_connection, 
connection_record):

Review comment:
       I think the only way to properly test it is to engage our users with 
some test/staging systems where they can simulate and test with production-like 
loads. I think the nature of this makes it super difficult to test. The problem 
is that the low-level behaviour that we have no way to control from the 
"higher-layer" logic which depends on factors that are pretty much random (i.e. 
how many transactions we run in parallel, whether some other threads were using 
the connections recently and returned it to the pool and how much load on the 
database we have in general (how responsive the DB server is).
   
   I can easily imagine scenarios where  - for example - some workers will get 
periodic bursts of timed-out connections for all their threads when other 
workers start to do a lot of database transactions - because those workers will 
**just** cross the timeout limit and all of them start failing. Or a periodic 
failure of random tasks without a reason we could track or mitigate. Those are 
the most difficult things to track.
   
   I do not have a very concrete scenario that could be easily simulated - It's 
more from intuition and experiencing similar problems in telecom industry years 
ago. We can try to mitigate it for example by adding some randomness around the 
timeout values and other things. but I think there is a good reason why the 
default timeout is 8 hrs and not say - 2 minutes. That makes me a little 
worried.
   
   We might ask our customers to try this out once the whole thing is merged 
(including HA Scheduler). But I think during the process I would like to 
understand how critical it is to have this timeout. Maybe we can try some other 
approaches if all that we want to protect from is avoiding stale locks. Seems 
to me this is a bit risky to potentially change behaviour of all transactions 
when we want to protect from something that will happen very rarely.
   
   For example, I could easily imagine we could run a periodic monitor to check 
if there are stale locks and selectively kill only those sessions.? That might 
call for a solution which is much more database-specific, but it might be 
less-risky, more efficient approach.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on a change in pull request #10933: Ensure idle database connections don't hang around for ever

Reply via email to