I think Dev@ is a better place (added). I understand the frustration, those
kind of errors are the worst - It's like the Shroedinger's cat - neither
dead nor alive until you look at it.

My personal view is that whenever situations like this happen, the software
should crash hard immediately. You save a lot of debugging, frustration and
engineering powers in order to try to workaround this kind of situations
and try to recover, but there will always be edge cases that you won't
think about  - crashing the software hard in such case is much better,
because in your deployment you need to handle restarts anyway, and starting
"clean" is much better than trying to clean-up while you are running.
Especially with most of the "serious" deployment you have certain
redundancy - in our case we already can have multiple schedulers, multiple
workers and multiple webservers, so restarting  either is not a problem.
Then recovery can (and usually will be) handled at the higher "deployment"
level - either docker compose. or K8S or custom scripts should restart such
a failed component.

Could you please share with us errors that are printed in such cases in the
logs of airflow - ideally "webserver", "scheduler", "worker" if you happen
to run Celery ? I think if we see what's going on we can investigate why
you have this "hanging" case and implement "crash hard" there. If you could
open a GitHub Issue with all the details there (cc: me - @potiuk when you
do) https://github.com/apache/airflow - I am happy to take a look at that.
However I am a bit surprised it happens, my belief is that airflow WILL
crash hard on metadata db access problem. The problem might be if Airflow
is also unaware that the connection to DB is not working.

There might be another case - and it might result from the way galera
cluster proxy works. This actually might be a configuration of timeouts in
MySQL. In case you cannot see any logs in airflow indicating errors, I
think you might have the case that either connection from airflow is simply
in "opening" state for a long time, or already established connection is
simply not killed by the proxy. In this case this is really the question of
bad configuration of

a) the proxy configuration - the proxy, when doing failover, should either
transparently move the open/being established connection or kill them. If
they kept running. the client will "think" that the connection is still
alive and send queries there and possibly wait for answers for quite some
time. I do not know Galera but I am sure they have some flexibility there
and maybe there are some options you can change
b) the mysql server configuration - the client can use various techniques
to determine if the server is up - there are various timeouts you can
configure (
https://stackoverflow.com/questions/14726789/how-can-i-change-the-default-mysql-connection-timeout-when-connecting-through-py
)
+----------------------------+----------+
| Variable_name              | Value    |
+----------------------------+----------+
| connect_timeout            | 10       |
| delayed_insert_timeout     | 300      |
| innodb_lock_wait_timeout   | 50       |
| innodb_rollback_on_timeout | OFF      |
| interactive_timeout        | 28800    |
| lock_wait_timeout          | 31536000 |
| net_read_timeout           | 30       |
| net_write_timeout          | 60       |
| slave_net_timeout          | 3600     |
| wait_timeout               | 28800    |
+----------------------------+----------+
However, I think this configuration should have limited impact - it might
speed up the actual fallback done by proxy, but I think it will not he

c) Finally (and THIS is probably what can help you immediately) - you can
fine-tune the client configuration. In Airflow you can configure various
SQLAlchemy parameters to better handle your deployment.
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#core
(just look for all parameters starting with sqlalchemy). We are using
sqlachemy to connect to the metadata DB and it has everything that you need
to fine tune your configuration and  - for example - setup timeouts for
different situations. In your case you probably should configure
`sql_alchemy_connect_args`:
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-connect-args
- you will see some links in our docs to sqlalchemy that you can follow and
see some examples there.. This is a simple dictionary of extra parameters
that should be passed to sqlalchemy engine initialization. Most likely
simply need to provide a client-controlled timeout on either establishing
connection, or running query or both. Those parameters depend on the
dialect used (MySql/Postgres) but also they are different capabilities
depending on which library you use to connect to mysql (the available
libraries are listed here:
https://docs.sqlalchemy.org/en/14/core/engines.html#mysql - and each of
them has different parameters, you need check which ones are good for each
library. However I think one of the { "timeout": N } or {"connect_timeout":
N } should work in all the libraries.

There is also one other parameter that might help
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-pool-pre-ping
- it defaults to "True", so maybe you have it disabled and that's the root
cause. This parameter performs full database operation for every connection
established, to make sure that the server is responding. This parameter
will help in case your proxy accepts connection, but - for whatever reason
it is stuck. Maybe that's the problem you have.

Just to summarize:  I think that looking at how your proxy behaves and
simple fine tuning of the Airflow SQLalchemy configuration might help
(especially if you do not see any obvious errors while you observe the
"hangs". However if you see that there are some errors in Airflow logs that
do not result in Airflow crashing - please let us know via issue and we
will take a look at that.

J.


On Tue, Jul 20, 2021 at 11:19 PM Rolf Fokkens <[email protected]>
wrote:

> Hi!
>
> Not sure if this is the proper place for this question; of not please let
> me know.
>
> We're running airflow on a mariadb/galera cluster, and we 're using
> haproxy to provide HA connections. Sometimes (mostly due to maintenance)
> one node is temporarily unavailalble, which forces haproxy to drop
> connections to this node after which new connections are passed to another
> (still running) node. This is quite common, and we use it for other
> software too. See
> https://galeracluster.com/library/documentation/ha-proxy.html.
> <https://galeracluster.com/library/documentation/ha-proxy.html> for more
> info.
>
> The issue we're running into however is the fact that airflow gets lost in
> this situation which to airflow is something like a dropped connection.
> airflow services seem to be running (they themselves thing they are
> running) but they're just stuck. So they don't do anything, but we don't
> know.
>
> Aiming at a HA setup, this is deffinitely not what we want. Colleages
> actually are now at the point that they disqualify airflow.
>
> Of course I can provide more details if needed, but I'd like to know first
> if this is the right place to bring this up.
>
> Best,
>
> Rolf
>


-- 
+48 660 796 129

Reply via email to