wjddn279 opened a new issue, #56879:
URL: https://github.com/apache/airflow/issues/56879

   ### Apache Airflow version
   
   3.1.0
   
   ### If "Other Airflow 2/3 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   I deployed Airflow in a Kubernetes environment and observed that the 
dag-processor was restarting irregularly. Upon checking the error logs, I found 
the following issues: (I attach full log files: 
[full_log.txt](https://github.com/user-attachments/files/22513572/full_log.txt))
   
   ```
   File 
"/home/airflow/.local/lib/python3.11/site-packages/MySQLdb/connections.py", 
line 280, in query
   _mysql.connection.query(self, query)
   sqlalchemy.exc.OperationalError: (MySQLdb.OperationalError) (2013, 'Lost 
connection to server during query')
   [SQL: SELECT dag_priority_parsing_request.id, 
dag_priority_parsing_request.bundle_name, 
dag_priority_parsing_request.relative_fileloc
   FROM dag_priority_parsing_request
   WHERE dag_priority_parsing_request.bundle_name IN (%s)]
   [parameters: ('dags-folder',)]
   (Background on this error at: https://sqlalche.me/e/14/e3q8)
   [2025-09-24T17:01:10.882+0900] {settings.py:494} DEBUG - Disposing DB 
connection pool (PID 7)
   ```
   
   ### What you think should happen instead?
   
   the `_get_priority_files()` function executes a query, during which the 
MySQL connection is unexpectedly closed. This raises an exception and causes 
the dag-processor to exit.
   I also identified other exceptions. While those do not lead to termination 
(since they are covered by retry logic and try-catch blocks), they appear to be 
caused by the same underlying issue: sudden termination of MySQL connections.
   
   By reviewing the queries arriving at MySQL during the error times, I 
confirmed that the connection was indeed being closed with a Quit command.
   
   <img width="2124" height="276" alt="Image" 
src="https://github.com/user-attachments/assets/21d42381-0ca0-4659-9996-ead07be49df4";
 />
   
   ### How to reproduce
   
   - deploy airflow 3+
   - backend: mysql
   
   ### Operating System
   
   k8s 
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   - k8s deployment
   - mysql 8.0+ version
   - official helm chart
   
   
   ### Anything else?
   
   ### Why is the Quit signal being sent?
   
   In conclusion, the root cause is the recreation of the existing engine pool 
object in the subprocess during a `fork`, as performed in 
[airflow/settings.py#L426-L436](https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/settings.py#L426-L436).
   
   1. The subprocess recreates the pool of the engine when it starts.  
   2. The prior pool connection objects copied from the parent process lose 
their references.  
   3. They are garbage-collected, which closes the connections originally 
established by the parent.  
   4. The parent process, unaware that its pool connections have been closed, 
attempts a query and encounters an error.  
   
   To verify this, I added the following code to observe when the initially 
established connections from the parent process are garbage-collected:
   ```
     @event.listens_for(engine, "connect")
     def set_mysql_timezone(dbapi_connection, connection_record)
         log.debug(f"[connect] New DB connection established, id={os.getpid()}")
   
         weakref.finalize(dbapi_connection,
                          lambda: print(f"{datetime.now().isoformat()} 
dbapi_connection finalized via weakref in os {os.getpid()}",
                                        ))
         weakref.finalize(connection_record, lambda: 
print(f"{datetime.now().isoformat()} connection_record finalized via weakref in 
os {os.getpid()}"))
   ```
   The following logs were observed. The timestamp of the logs exactly matched 
the time when the `Quit` command appeared in the MySQL query log. The fact that 
they occurred in PID 417 indicates that the copied pool from the parent process 
was garbage-collected in the child process:
   ```
   2025-09-22T13:41:30.352393 connection_record finalized via weakref in os 417
   2025-09-22T13:41:30.352403 dbapi_connection finalized via weakref in os 417
   ```
   
   ### Why does this issue occur only with MySQL and not with PostgreSQL?
   Based on testing, both engines trigger garbage collection in the subprocess 
under the current code.
   However, in MySQL’s case, the driver explicitly sends a COM_QUIT command 
when a connection object is garbage-collected, as shown in 
[mysqlclient/_mysql.c#L2233-L2243](https://github.com/PyMySQL/mysqlclient/blob/main/src/MySQLdb/_mysql.c#L2233-L2243)
   
   In contrast, PostgreSQL does not appear to close the connection itself—it 
seems to only close the socket without sending a termination command. (This is 
the behavior we want.)
   
   Additionally, if the garbage collection threshold is not reached (such as in 
small-scale DAG parsing scenarios), garbage collection does not occur — and in 
such cases, the issue does not happen even with MySQL.
   
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to