Re: airflow issue when intermittent mysql connection drops

Jarek Potiuk Wed, 21 Jul 2021 00:30:54 -0700

Yeah. There are various HA solutions (Active-Passive) for both
Postgres/MySQL. I think they are mostly commercial offerings (but hey, if
you want HA you can afford it  ;) ).
BTW. We do not support Active/Active HA scenario (yet). It is mostly not
needed (DB is rarely a bottleneck even in Huge installations) so
Active/Active is not really necessary, and we are heavily relying on modern
locking solutions for Scheduler HA, so for now we can just confirm
that Active/Passive HA setup for DB is the way to go.


J


On Wed, Jul 21, 2021 at 9:09 AM Rolf Fokkens <[email protected]>
wrote:

> Thanks for the info.
>
> As mentioned we aim for high availability, and mariadb/galera provide a HA
> installation of a ralational database. Do the supported alternatives
> provide HA?
>
> I'll look into the matrix; will get back on that.
>
> On Wed, 2021-07-21 at 08:59 +0200, Jarek Potiuk wrote:
>
> BTW. Just to add on top of that - Officially we do not have MariaDB
> support. We only support Postgres, MySQL and experimentally MSSQL in main.
> So while mariadb is mostly compatible with mysql, it's a bit your decision
> to use MariaDB and you are quite a bit on your own when it comes to
> MariaDB. In fact we even discourage people from using MariaDB
> https://github.com/apache/airflow#requirements  because we do not run
> tests with either version of MariaDB in our CI.
>
> However - if you want to stick to MariaDB and are committed to it - maybe
> you could contribute and add support for MariaDB in our CI ?
> Very recently one of our users (now committer) Aneesh added support for
> MSSQL https://github.com/apache/airflow/pull/9973 and if you would like
> to follow his footsteps - adding MariaDB to the matrix of tests should not
> be too difficult.
>
> J.
>
>
> On Wed, Jul 21, 2021 at 8:48 AM Jarek Potiuk <[email protected]> wrote:
>
> I think Dev@ is a better place (added). I understand the frustration,
> those kind of errors are the worst - It's like the Shroedinger's cat -
> neither dead nor alive until you look at it.
>
> My personal view is that whenever situations like this happen, the
> software should crash hard immediately. You save a lot of debugging,
> frustration and engineering powers in order to try to workaround this kind
> of situations and try to recover, but there will always be edge cases that
> you won't think about  - crashing the software hard in such case is much
> better, because in your deployment you need to handle restarts anyway, and
> starting "clean" is much better than trying to clean-up while you are
> running. Especially with most of the "serious" deployment you have certain
> redundancy - in our case we already can have multiple schedulers, multiple
> workers and multiple webservers, so restarting  either is not a problem.
> Then recovery can (and usually will be) handled at the higher "deployment"
> level - either docker compose. or K8S or custom scripts should restart such
> a failed component.
>
> Could you please share with us errors that are printed in such cases in
> the logs of airflow - ideally "webserver", "scheduler", "worker" if you
> happen to run Celery ? I think if we see what's going on we can investigate
> why you have this "hanging" case and implement "crash hard" there. If you
> could open a GitHub Issue with all the details there (cc: me - @potiuk when
> you do) https://github.com/apache/airflow - I am happy to take a look at
> that. However I am a bit surprised it happens, my belief is that airflow
> WILL crash hard on metadata db access problem. The problem might be if
> Airflow is also unaware that the connection to DB is not working.
>
> There might be another case - and it might result from the way galera
> cluster proxy works. This actually might be a configuration of timeouts in
> MySQL. In case you cannot see any logs in airflow indicating errors, I
> think you might have the case that either connection from airflow is simply
> in "opening" state for a long time, or already established connection is
> simply not killed by the proxy. In this case this is really the question of
> bad configuration of
>
> a) the proxy configuration - the proxy, when doing failover, should either
> transparently move the open/being established connection or kill them. If
> they kept running. the client will "think" that the connection is still
> alive and send queries there and possibly wait for answers for quite some
> time. I do not know Galera but I am sure they have some flexibility there
> and maybe there are some options you can change
> b) the mysql server configuration - the client can use various techniques
> to determine if the server is up - there are various timeouts you can
> configure (
> https://stackoverflow.com/questions/14726789/how-can-i-change-the-default-mysql-connection-timeout-when-connecting-through-py
> )
> +----------------------------+----------+
> | Variable_name              | Value    |
> +----------------------------+----------+
> | connect_timeout            | 10       |
> | delayed_insert_timeout     | 300      |
> | innodb_lock_wait_timeout   | 50       |
> | innodb_rollback_on_timeout | OFF      |
> | interactive_timeout        | 28800    |
> | lock_wait_timeout          | 31536000 |
> | net_read_timeout           | 30       |
> | net_write_timeout          | 60       |
> | slave_net_timeout          | 3600     |
> | wait_timeout               | 28800    |
> +----------------------------+----------+
> However, I think this configuration should have limited impact - it might
> speed up the actual fallback done by proxy, but I think it will not he
>
> c) Finally (and THIS is probably what can help you immediately) - you can
> fine-tune the client configuration. In Airflow you can configure various
> SQLAlchemy parameters to better handle your deployment.
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#core
> (just look for all parameters starting with sqlalchemy). We are using
> sqlachemy to connect to the metadata DB and it has everything that you need
> to fine tune your configuration and  - for example - setup timeouts for
> different situations. In your case you probably should configure
> `sql_alchemy_connect_args`:
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-connect-args
> - you will see some links in our docs to sqlalchemy that you can follow and
> see some examples there.. This is a simple dictionary of extra parameters
> that should be passed to sqlalchemy engine initialization. Most likely
> simply need to provide a client-controlled timeout on either establishing
> connection, or running query or both. Those parameters depend on the
> dialect used (MySql/Postgres) but also they are different capabilities
> depending on which library you use to connect to mysql (the available
> libraries are listed here:
> https://docs.sqlalchemy.org/en/14/core/engines.html#mysql - and each of
> them has different parameters, you need check which ones are good for each
> library. However I think one of the { "timeout": N } or {"connect_timeout":
> N } should work in all the libraries.
>
> There is also one other parameter that might help
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-pool-pre-ping
> - it defaults to "True", so maybe you have it disabled and that's the root
> cause. This parameter performs full database operation for every connection
> established, to make sure that the server is responding. This parameter
> will help in case your proxy accepts connection, but - for whatever reason
> it is stuck. Maybe that's the problem you have.
>
> Just to summarize:  I think that looking at how your proxy behaves and
> simple fine tuning of the Airflow SQLalchemy configuration might help
> (especially if you do not see any obvious errors while you observe the
> "hangs". However if you see that there are some errors in Airflow logs that
> do not result in Airflow crashing - please let us know via issue and we
> will take a look at that.
>
> J.
>
>
> On Tue, Jul 20, 2021 at 11:19 PM Rolf Fokkens <[email protected]>
> wrote:
>
> Hi!
>
> Not sure if this is the proper place for this question; of not please let
> me know.
>
> We're running airflow on a mariadb/galera cluster, and we 're using
> haproxy to provide HA connections. Sometimes (mostly due to maintenance)
> one node is temporarily unavailalble, which forces haproxy to drop
> connections to this node after which new connections are passed to another
> (still running) node. This is quite common, and we use it for other
> software too. See
> https://galeracluster.com/library/documentation/ha-proxy.html.
> <https://galeracluster.com/library/documentation/ha-proxy.html> for more
> info.
>
> The issue we're running into however is the fact that airflow gets lost in
> this situation which to airflow is something like a dropped connection.
> airflow services seem to be running (they themselves thing they are
> running) but they're just stuck. So they don't do anything, but we don't
> know.
>
> Aiming at a HA setup, this is deffinitely not what we want. Colleages
> actually are now at the point that they disqualify airflow.
>
> Of course I can provide more details if needed, but I'd like to know first
> if this is the right place to bring this up.
>
> Best,
>
> Rolf
>
>
>
> --
> +48 660 796 129
>
>
>
>
>

-- 
+48 660 796 129

Re: airflow issue when intermittent mysql connection drops

Reply via email to