potiuk commented on issue #27100:
URL: https://github.com/apache/airflow/issues/27100#issuecomment-1289468566

   @karakanb  can you please take a look at your history/monitoring if any of 
the components of Airlfow (including pgbouncer) have restarted around the time 
when it happened? If so, can you please detail the restart events that you saw? 
I am particularly interested if there was any scheduler restart. Did you 
attempt to restart scheduler manually to fix the problem?
   
   From the logs you can see - there are multiple dag file processor "fatal" 
errors but no scheduler restart caused by the outage.
   
   > sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to 
server at "my-db-instance.b.db.ondigitalocean.com" (10.110.0.17), port 25061 
failed: FATAL:  pgbouncer cannot connect to server
   
   I think that one needs looking at @dstandish @ephraimbuddy @ashb @uranusjr  
- I saw other people reporting similar issues when there is a temporary problem 
with the database and my guts feeling tell me that this is the classic "zombie 
db application" problem - where application kind of works and keeps connections 
but some of the transactions got "completed" status and the application 
"thinks" that the transaction was successful, but the database failure 
prevented it from actual flushing the changes to the disk.
   
   Of course we cannot do much about it on the DB side and in running Airflow, 
but I'd say we should crash hard scheduler whenever any of the  processes or 
subprocesses gets "FATAL" error like that. Arflow has built in mechanism to 
reconcile its state whenever it gets restarted, and if the database has 
problems, it will fail to restart (and wlll be restarted until the DB is back). 
So if my guess was right, just restarting scheduler should have eventually fix 
the problem.
   
   If my guess is right - We can of course tell users to restart  scheduler in 
such cases, but this kind of error might get unnoticed by the user so it would 
have been much better if we detect such fatal errors happening and simply crash 
scheduler when it happens. That would make self-healing after such catastrophic 
events.
   
   Let me know what you think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to