Thanks for the info. As mentioned we aim for high availability, and mariadb/galera provide a HA installation of a ralational database. Do the supported alternatives provide HA?
I'll look into the matrix; will get back on that. On Wed, 2021-07-21 at 08:59 +0200, Jarek Potiuk wrote: BTW. Just to add on top of that - Officially we do not have MariaDB support. We only support Postgres, MySQL and experimentally MSSQL in main. So while mariadb is mostly compatible with mysql, it's a bit your decision to use MariaDB and you are quite a bit on your own when it comes to MariaDB. In fact we even discourage people from using MariaDB https://github.com/apache/airflow#requirements because we do not run tests with either version of MariaDB in our CI. However - if you want to stick to MariaDB and are committed to it - maybe you could contribute and add support for MariaDB in our CI ? Very recently one of our users (now committer) Aneesh added support for MSSQL https://github.com/apache/airflow/pull/9973 and if you would like to follow his footsteps - adding MariaDB to the matrix of tests should not be too difficult. J. On Wed, Jul 21, 2021 at 8:48 AM Jarek Potiuk <[email protected]<mailto:[email protected]>> wrote: I think Dev@ is a better place (added). I understand the frustration, those kind of errors are the worst - It's like the Shroedinger's cat - neither dead nor alive until you look at it. My personal view is that whenever situations like this happen, the software should crash hard immediately. You save a lot of debugging, frustration and engineering powers in order to try to workaround this kind of situations and try to recover, but there will always be edge cases that you won't think about - crashing the software hard in such case is much better, because in your deployment you need to handle restarts anyway, and starting "clean" is much better than trying to clean-up while you are running. Especially with most of the "serious" deployment you have certain redundancy - in our case we already can have multiple schedulers, multiple workers and multiple webservers, so restarting either is not a problem. Then recovery can (and usually will be) handled at the higher "deployment" level - either docker compose. or K8S or custom scripts should restart such a failed component. Could you please share with us errors that are printed in such cases in the logs of airflow - ideally "webserver", "scheduler", "worker" if you happen to run Celery ? I think if we see what's going on we can investigate why you have this "hanging" case and implement "crash hard" there. If you could open a GitHub Issue with all the details there (cc: me - @potiuk when you do) https://github.com/apache/airflow - I am happy to take a look at that. However I am a bit surprised it happens, my belief is that airflow WILL crash hard on metadata db access problem. The problem might be if Airflow is also unaware that the connection to DB is not working. There might be another case - and it might result from the way galera cluster proxy works. This actually might be a configuration of timeouts in MySQL. In case you cannot see any logs in airflow indicating errors, I think you might have the case that either connection from airflow is simply in "opening" state for a long time, or already established connection is simply not killed by the proxy. In this case this is really the question of bad configuration of a) the proxy configuration - the proxy, when doing failover, should either transparently move the open/being established connection or kill them. If they kept running. the client will "think" that the connection is still alive and send queries there and possibly wait for answers for quite some time. I do not know Galera but I am sure they have some flexibility there and maybe there are some options you can change b) the mysql server configuration - the client can use various techniques to determine if the server is up - there are various timeouts you can configure (https://stackoverflow.com/questions/14726789/how-can-i-change-the-default-mysql-connection-timeout-when-connecting-through-py) +----------------------------+----------+ | Variable_name | Value | +----------------------------+----------+ | connect_timeout | 10 | | delayed_insert_timeout | 300 | | innodb_lock_wait_timeout | 50 | | innodb_rollback_on_timeout | OFF | | interactive_timeout | 28800 | | lock_wait_timeout | 31536000 | | net_read_timeout | 30 | | net_write_timeout | 60 | | slave_net_timeout | 3600 | | wait_timeout | 28800 | +----------------------------+----------+ However, I think this configuration should have limited impact - it might speed up the actual fallback done by proxy, but I think it will not he c) Finally (and THIS is probably what can help you immediately) - you can fine-tune the client configuration. In Airflow you can configure various SQLAlchemy parameters to better handle your deployment. https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#core (just look for all parameters starting with sqlalchemy). We are using sqlachemy to connect to the metadata DB and it has everything that you need to fine tune your configuration and - for example - setup timeouts for different situations. In your case you probably should configure `sql_alchemy_connect_args`: https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-connect-args - you will see some links in our docs to sqlalchemy that you can follow and see some examples there.. This is a simple dictionary of extra parameters that should be passed to sqlalchemy engine initialization. Most likely simply need to provide a client-controlled timeout on either establishing connection, or running query or both. Those parameters depend on the dialect used (MySql/Postgres) but also they are different capabilities depending on which library you use to connect to mysql (the available libraries are listed here: https://docs.sqlalchemy.org/en/14/core/engines.html#mysql - and each of them has different parameters, you need check which ones are good for each library. However I think one of the { "timeout": N } or {"connect_timeout": N } should work in all the libraries. There is also one other parameter that might help https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-pool-pre-ping - it defaults to "True", so maybe you have it disabled and that's the root cause. This parameter performs full database operation for every connection established, to make sure that the server is responding. This parameter will help in case your proxy accepts connection, but - for whatever reason it is stuck. Maybe that's the problem you have. Just to summarize: I think that looking at how your proxy behaves and simple fine tuning of the Airflow SQLalchemy configuration might help (especially if you do not see any obvious errors while you observe the "hangs". However if you see that there are some errors in Airflow logs that do not result in Airflow crashing - please let us know via issue and we will take a look at that. J. On Tue, Jul 20, 2021 at 11:19 PM Rolf Fokkens <[email protected]<mailto:[email protected]>> wrote: Hi! Not sure if this is the proper place for this question; of not please let me know. We're running airflow on a mariadb/galera cluster, and we 're using haproxy to provide HA connections. Sometimes (mostly due to maintenance) one node is temporarily unavailalble, which forces haproxy to drop connections to this node after which new connections are passed to another (still running) node. This is quite common, and we use it for other software too. See https://galeracluster.com/library/documentation/ha-proxy.html.<https://galeracluster.com/library/documentation/ha-proxy.html> for more info. The issue we're running into however is the fact that airflow gets lost in this situation which to airflow is something like a dropped connection. airflow services seem to be running (they themselves thing they are running) but they're just stuck. So they don't do anything, but we don't know. Aiming at a HA setup, this is deffinitely not what we want. Colleages actually are now at the point that they disqualify airflow. Of course I can provide more details if needed, but I'd like to know first if this is the right place to bring this up. Best, Rolf -- +48 660 796 129
