Thanks for the info.

As mentioned we aim for high availability, and mariadb/galera provide a HA 
installation of a ralational database. Do the supported alternatives provide HA?

I'll look into the matrix; will get back on that.

On Wed, 2021-07-21 at 08:59 +0200, Jarek Potiuk wrote:
BTW. Just to add on top of that - Officially we do not have MariaDB support. We 
only support Postgres, MySQL and experimentally MSSQL in main. So while mariadb 
is mostly compatible with mysql, it's a bit your decision to use MariaDB and 
you are quite a bit on your own when it comes to MariaDB. In fact we even 
discourage people from using MariaDB 
https://github.com/apache/airflow#requirements  because we do not run tests 
with either version of MariaDB in our CI.

However - if you want to stick to MariaDB and are committed to it - maybe you 
could contribute and add support for MariaDB in our CI ?
Very recently one of our users (now committer) Aneesh added support for MSSQL 
https://github.com/apache/airflow/pull/9973 and if you would like to follow his 
footsteps - adding MariaDB to the matrix of tests should not be too difficult.

J.


On Wed, Jul 21, 2021 at 8:48 AM Jarek Potiuk 
<[email protected]<mailto:[email protected]>> wrote:
I think Dev@ is a better place (added). I understand the frustration, those 
kind of errors are the worst - It's like the Shroedinger's cat - neither dead 
nor alive until you look at it.

My personal view is that whenever situations like this happen, the software 
should crash hard immediately. You save a lot of debugging, frustration and 
engineering powers in order to try to workaround this kind of situations and 
try to recover, but there will always be edge cases that you won't think about  
- crashing the software hard in such case is much better, because in your 
deployment you need to handle restarts anyway, and starting "clean" is much 
better than trying to clean-up while you are running. Especially with most of 
the "serious" deployment you have certain redundancy - in our case we already 
can have multiple schedulers, multiple workers and multiple webservers, so 
restarting  either is not a problem. Then recovery can (and usually will be) 
handled at the higher "deployment" level - either docker compose. or K8S or 
custom scripts should restart such a failed component.

Could you please share with us errors that are printed in such cases in the 
logs of airflow - ideally "webserver", "scheduler", "worker" if you happen to 
run Celery ? I think if we see what's going on we can investigate why you have 
this "hanging" case and implement "crash hard" there. If you could open a 
GitHub Issue with all the details there (cc: me - @potiuk when you do) 
https://github.com/apache/airflow - I am happy to take a look at that. However 
I am a bit surprised it happens, my belief is that airflow WILL crash hard on 
metadata db access problem. The problem might be if Airflow is also unaware 
that the connection to DB is not working.

There might be another case - and it might result from the way galera cluster 
proxy works. This actually might be a configuration of timeouts in MySQL. In 
case you cannot see any logs in airflow indicating errors, I think you might 
have the case that either connection from airflow is simply in "opening" state 
for a long time, or already established connection is simply not killed by the 
proxy. In this case this is really the question of bad configuration of

a) the proxy configuration - the proxy, when doing failover, should either 
transparently move the open/being established connection or kill them. If they 
kept running. the client will "think" that the connection is still alive and 
send queries there and possibly wait for answers for quite some time. I do not 
know Galera but I am sure they have some flexibility there and maybe there are 
some options you can change
b) the mysql server configuration - the client can use various techniques to 
determine if the server is up - there are various timeouts you can configure 
(https://stackoverflow.com/questions/14726789/how-can-i-change-the-default-mysql-connection-timeout-when-connecting-through-py)
+----------------------------+----------+
| Variable_name              | Value    |
+----------------------------+----------+
| connect_timeout            | 10       |
| delayed_insert_timeout     | 300      |
| innodb_lock_wait_timeout   | 50       |
| innodb_rollback_on_timeout | OFF      |
| interactive_timeout        | 28800    |
| lock_wait_timeout          | 31536000 |
| net_read_timeout           | 30       |
| net_write_timeout          | 60       |
| slave_net_timeout          | 3600     |
| wait_timeout               | 28800    |
+----------------------------+----------+
However, I think this configuration should have limited impact - it might speed 
up the actual fallback done by proxy, but I think it will not he

c) Finally (and THIS is probably what can help you immediately) - you can 
fine-tune the client configuration. In Airflow you can configure various 
SQLAlchemy parameters to better handle your deployment. 
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#core
 (just look for all parameters starting with sqlalchemy). We are using 
sqlachemy to connect to the metadata DB and it has everything that you need to 
fine tune your configuration and  - for example - setup timeouts for different 
situations. In your case you probably should configure 
`sql_alchemy_connect_args`: 
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-connect-args
 - you will see some links in our docs to sqlalchemy that you can follow and 
see some examples there.. This is a simple dictionary of extra parameters that 
should be passed to sqlalchemy engine initialization. Most likely simply need 
to provide a client-controlled timeout on either establishing connection, or 
running query or both. Those parameters depend on the dialect used 
(MySql/Postgres) but also they are different capabilities depending on which 
library you use to connect to mysql (the available libraries are listed here: 
https://docs.sqlalchemy.org/en/14/core/engines.html#mysql - and each of them 
has different parameters, you need check which ones are good for each library. 
However I think one of the { "timeout": N } or {"connect_timeout": N } should 
work in all the libraries.

There is also one other parameter that might help 
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-pool-pre-ping
 - it defaults to "True", so maybe you have it disabled and that's the root 
cause. This parameter performs full database operation for every connection 
established, to make sure that the server is responding. This parameter will 
help in case your proxy accepts connection, but - for whatever reason it is 
stuck. Maybe that's the problem you have.

Just to summarize:  I think that looking at how your proxy behaves and simple 
fine tuning of the Airflow SQLalchemy configuration might help (especially if 
you do not see any obvious errors while you observe the "hangs". However if you 
see that there are some errors in Airflow logs that do not result in Airflow 
crashing - please let us know via issue and we will take a look at that.

J.


On Tue, Jul 20, 2021 at 11:19 PM Rolf Fokkens 
<[email protected]<mailto:[email protected]>> wrote:
Hi!

Not sure if this is the proper place for this question; of not please let me 
know.

We're running airflow on a mariadb/galera cluster, and we 're using haproxy to 
provide HA connections. Sometimes (mostly due to maintenance) one node is 
temporarily unavailalble, which forces haproxy to drop connections to this node 
after which new connections are passed to another (still running) node. This is 
quite common, and we use it for other software too. See 
https://galeracluster.com/library/documentation/ha-proxy.html.<https://galeracluster.com/library/documentation/ha-proxy.html>
 for more info.

The issue we're running into however is the fact that airflow gets lost in this 
situation which to airflow is something like a dropped connection. airflow 
services seem to be running (they themselves thing they are running) but 
they're just stuck. So they don't do anything, but we don't know.

Aiming at a HA setup, this is deffinitely not what we want. Colleages actually 
are now at the point that they disqualify airflow.

Of course I can provide more details if needed, but I'd like to know first if 
this is the right place to bring this up.

Best,

Rolf


--
+48 660 796 129



Reply via email to