BTW. Just to add on top of that - Officially we do not have MariaDB support. We only support Postgres, MySQL and experimentally MSSQL in main. So while mariadb is mostly compatible with mysql, it's a bit your decision to use MariaDB and you are quite a bit on your own when it comes to MariaDB. In fact we even discourage people from using MariaDB https://github.com/apache/airflow#requirements because we do not run tests with either version of MariaDB in our CI.
However - if you want to stick to MariaDB and are committed to it - maybe you could contribute and add support for MariaDB in our CI ? Very recently one of our users (now committer) Aneesh added support for MSSQL https://github.com/apache/airflow/pull/9973 and if you would like to follow his footsteps - adding MariaDB to the matrix of tests should not be too difficult. J. On Wed, Jul 21, 2021 at 8:48 AM Jarek Potiuk <[email protected]> wrote: > I think Dev@ is a better place (added). I understand the frustration, > those kind of errors are the worst - It's like the Shroedinger's cat - > neither dead nor alive until you look at it. > > My personal view is that whenever situations like this happen, the > software should crash hard immediately. You save a lot of debugging, > frustration and engineering powers in order to try to workaround this kind > of situations and try to recover, but there will always be edge cases that > you won't think about - crashing the software hard in such case is much > better, because in your deployment you need to handle restarts anyway, and > starting "clean" is much better than trying to clean-up while you are > running. Especially with most of the "serious" deployment you have certain > redundancy - in our case we already can have multiple schedulers, multiple > workers and multiple webservers, so restarting either is not a problem. > Then recovery can (and usually will be) handled at the higher "deployment" > level - either docker compose. or K8S or custom scripts should restart such > a failed component. > > Could you please share with us errors that are printed in such cases in > the logs of airflow - ideally "webserver", "scheduler", "worker" if you > happen to run Celery ? I think if we see what's going on we can investigate > why you have this "hanging" case and implement "crash hard" there. If you > could open a GitHub Issue with all the details there (cc: me - @potiuk when > you do) https://github.com/apache/airflow - I am happy to take a look at > that. However I am a bit surprised it happens, my belief is that airflow > WILL crash hard on metadata db access problem. The problem might be if > Airflow is also unaware that the connection to DB is not working. > > There might be another case - and it might result from the way galera > cluster proxy works. This actually might be a configuration of timeouts in > MySQL. In case you cannot see any logs in airflow indicating errors, I > think you might have the case that either connection from airflow is simply > in "opening" state for a long time, or already established connection is > simply not killed by the proxy. In this case this is really the question of > bad configuration of > > a) the proxy configuration - the proxy, when doing failover, should either > transparently move the open/being established connection or kill them. If > they kept running. the client will "think" that the connection is still > alive and send queries there and possibly wait for answers for quite some > time. I do not know Galera but I am sure they have some flexibility there > and maybe there are some options you can change > b) the mysql server configuration - the client can use various techniques > to determine if the server is up - there are various timeouts you can > configure ( > https://stackoverflow.com/questions/14726789/how-can-i-change-the-default-mysql-connection-timeout-when-connecting-through-py > ) > +----------------------------+----------+ > | Variable_name | Value | > +----------------------------+----------+ > | connect_timeout | 10 | > | delayed_insert_timeout | 300 | > | innodb_lock_wait_timeout | 50 | > | innodb_rollback_on_timeout | OFF | > | interactive_timeout | 28800 | > | lock_wait_timeout | 31536000 | > | net_read_timeout | 30 | > | net_write_timeout | 60 | > | slave_net_timeout | 3600 | > | wait_timeout | 28800 | > +----------------------------+----------+ > However, I think this configuration should have limited impact - it might > speed up the actual fallback done by proxy, but I think it will not he > > c) Finally (and THIS is probably what can help you immediately) - you can > fine-tune the client configuration. In Airflow you can configure various > SQLAlchemy parameters to better handle your deployment. > https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#core > (just look for all parameters starting with sqlalchemy). We are using > sqlachemy to connect to the metadata DB and it has everything that you need > to fine tune your configuration and - for example - setup timeouts for > different situations. In your case you probably should configure > `sql_alchemy_connect_args`: > https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-connect-args > - you will see some links in our docs to sqlalchemy that you can follow and > see some examples there.. This is a simple dictionary of extra parameters > that should be passed to sqlalchemy engine initialization. Most likely > simply need to provide a client-controlled timeout on either establishing > connection, or running query or both. Those parameters depend on the > dialect used (MySql/Postgres) but also they are different capabilities > depending on which library you use to connect to mysql (the available > libraries are listed here: > https://docs.sqlalchemy.org/en/14/core/engines.html#mysql - and each of > them has different parameters, you need check which ones are good for each > library. However I think one of the { "timeout": N } or {"connect_timeout": > N } should work in all the libraries. > > There is also one other parameter that might help > https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-pool-pre-ping > - it defaults to "True", so maybe you have it disabled and that's the root > cause. This parameter performs full database operation for every connection > established, to make sure that the server is responding. This parameter > will help in case your proxy accepts connection, but - for whatever reason > it is stuck. Maybe that's the problem you have. > > Just to summarize: I think that looking at how your proxy behaves and > simple fine tuning of the Airflow SQLalchemy configuration might help > (especially if you do not see any obvious errors while you observe the > "hangs". However if you see that there are some errors in Airflow logs that > do not result in Airflow crashing - please let us know via issue and we > will take a look at that. > > J. > > > On Tue, Jul 20, 2021 at 11:19 PM Rolf Fokkens <[email protected]> > wrote: > >> Hi! >> >> Not sure if this is the proper place for this question; of not please let >> me know. >> >> We're running airflow on a mariadb/galera cluster, and we 're using >> haproxy to provide HA connections. Sometimes (mostly due to maintenance) >> one node is temporarily unavailalble, which forces haproxy to drop >> connections to this node after which new connections are passed to another >> (still running) node. This is quite common, and we use it for other >> software too. See >> https://galeracluster.com/library/documentation/ha-proxy.html. >> <https://galeracluster.com/library/documentation/ha-proxy.html> for more >> info. >> >> The issue we're running into however is the fact that airflow gets lost >> in this situation which to airflow is something like a dropped connection. >> airflow services seem to be running (they themselves thing they are >> running) but they're just stuck. So they don't do anything, but we don't >> know. >> >> Aiming at a HA setup, this is deffinitely not what we want. Colleages >> actually are now at the point that they disqualify airflow. >> >> Of course I can provide more details if needed, but I'd like to know >> first if this is the right place to bring this up. >> >> Best, >> >> Rolf >> > > > -- > +48 660 796 129 > -- +48 660 796 129
