BTW. Just to add on top of that - Officially we do not have MariaDB
support. We only support Postgres, MySQL and experimentally MSSQL in main.
So while mariadb is mostly compatible with mysql, it's a bit your decision
to use MariaDB and you are quite a bit on your own when it comes to
MariaDB. In fact we even discourage people from using MariaDB
https://github.com/apache/airflow#requirements  because we do not run tests
with either version of MariaDB in our CI.

However - if you want to stick to MariaDB and are committed to it - maybe
you could contribute and add support for MariaDB in our CI ?
Very recently one of our users (now committer) Aneesh added support for
MSSQL https://github.com/apache/airflow/pull/9973 and if you would like to
follow his footsteps - adding MariaDB to the matrix of tests should not be
too difficult.

J.


On Wed, Jul 21, 2021 at 8:48 AM Jarek Potiuk <[email protected]> wrote:

> I think Dev@ is a better place (added). I understand the frustration,
> those kind of errors are the worst - It's like the Shroedinger's cat -
> neither dead nor alive until you look at it.
>
> My personal view is that whenever situations like this happen, the
> software should crash hard immediately. You save a lot of debugging,
> frustration and engineering powers in order to try to workaround this kind
> of situations and try to recover, but there will always be edge cases that
> you won't think about  - crashing the software hard in such case is much
> better, because in your deployment you need to handle restarts anyway, and
> starting "clean" is much better than trying to clean-up while you are
> running. Especially with most of the "serious" deployment you have certain
> redundancy - in our case we already can have multiple schedulers, multiple
> workers and multiple webservers, so restarting  either is not a problem.
> Then recovery can (and usually will be) handled at the higher "deployment"
> level - either docker compose. or K8S or custom scripts should restart such
> a failed component.
>
> Could you please share with us errors that are printed in such cases in
> the logs of airflow - ideally "webserver", "scheduler", "worker" if you
> happen to run Celery ? I think if we see what's going on we can investigate
> why you have this "hanging" case and implement "crash hard" there. If you
> could open a GitHub Issue with all the details there (cc: me - @potiuk when
> you do) https://github.com/apache/airflow - I am happy to take a look at
> that. However I am a bit surprised it happens, my belief is that airflow
> WILL crash hard on metadata db access problem. The problem might be if
> Airflow is also unaware that the connection to DB is not working.
>
> There might be another case - and it might result from the way galera
> cluster proxy works. This actually might be a configuration of timeouts in
> MySQL. In case you cannot see any logs in airflow indicating errors, I
> think you might have the case that either connection from airflow is simply
> in "opening" state for a long time, or already established connection is
> simply not killed by the proxy. In this case this is really the question of
> bad configuration of
>
> a) the proxy configuration - the proxy, when doing failover, should either
> transparently move the open/being established connection or kill them. If
> they kept running. the client will "think" that the connection is still
> alive and send queries there and possibly wait for answers for quite some
> time. I do not know Galera but I am sure they have some flexibility there
> and maybe there are some options you can change
> b) the mysql server configuration - the client can use various techniques
> to determine if the server is up - there are various timeouts you can
> configure (
> https://stackoverflow.com/questions/14726789/how-can-i-change-the-default-mysql-connection-timeout-when-connecting-through-py
> )
> +----------------------------+----------+
> | Variable_name              | Value    |
> +----------------------------+----------+
> | connect_timeout            | 10       |
> | delayed_insert_timeout     | 300      |
> | innodb_lock_wait_timeout   | 50       |
> | innodb_rollback_on_timeout | OFF      |
> | interactive_timeout        | 28800    |
> | lock_wait_timeout          | 31536000 |
> | net_read_timeout           | 30       |
> | net_write_timeout          | 60       |
> | slave_net_timeout          | 3600     |
> | wait_timeout               | 28800    |
> +----------------------------+----------+
> However, I think this configuration should have limited impact - it might
> speed up the actual fallback done by proxy, but I think it will not he
>
> c) Finally (and THIS is probably what can help you immediately) - you can
> fine-tune the client configuration. In Airflow you can configure various
> SQLAlchemy parameters to better handle your deployment.
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#core
> (just look for all parameters starting with sqlalchemy). We are using
> sqlachemy to connect to the metadata DB and it has everything that you need
> to fine tune your configuration and  - for example - setup timeouts for
> different situations. In your case you probably should configure
> `sql_alchemy_connect_args`:
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-connect-args
> - you will see some links in our docs to sqlalchemy that you can follow and
> see some examples there.. This is a simple dictionary of extra parameters
> that should be passed to sqlalchemy engine initialization. Most likely
> simply need to provide a client-controlled timeout on either establishing
> connection, or running query or both. Those parameters depend on the
> dialect used (MySql/Postgres) but also they are different capabilities
> depending on which library you use to connect to mysql (the available
> libraries are listed here:
> https://docs.sqlalchemy.org/en/14/core/engines.html#mysql - and each of
> them has different parameters, you need check which ones are good for each
> library. However I think one of the { "timeout": N } or {"connect_timeout":
> N } should work in all the libraries.
>
> There is also one other parameter that might help
> https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#sql-alchemy-pool-pre-ping
> - it defaults to "True", so maybe you have it disabled and that's the root
> cause. This parameter performs full database operation for every connection
> established, to make sure that the server is responding. This parameter
> will help in case your proxy accepts connection, but - for whatever reason
> it is stuck. Maybe that's the problem you have.
>
> Just to summarize:  I think that looking at how your proxy behaves and
> simple fine tuning of the Airflow SQLalchemy configuration might help
> (especially if you do not see any obvious errors while you observe the
> "hangs". However if you see that there are some errors in Airflow logs that
> do not result in Airflow crashing - please let us know via issue and we
> will take a look at that.
>
> J.
>
>
> On Tue, Jul 20, 2021 at 11:19 PM Rolf Fokkens <[email protected]>
> wrote:
>
>> Hi!
>>
>> Not sure if this is the proper place for this question; of not please let
>> me know.
>>
>> We're running airflow on a mariadb/galera cluster, and we 're using
>> haproxy to provide HA connections. Sometimes (mostly due to maintenance)
>> one node is temporarily unavailalble, which forces haproxy to drop
>> connections to this node after which new connections are passed to another
>> (still running) node. This is quite common, and we use it for other
>> software too. See
>> https://galeracluster.com/library/documentation/ha-proxy.html.
>> <https://galeracluster.com/library/documentation/ha-proxy.html> for more
>> info.
>>
>> The issue we're running into however is the fact that airflow gets lost
>> in this situation which to airflow is something like a dropped connection.
>> airflow services seem to be running (they themselves thing they are
>> running) but they're just stuck. So they don't do anything, but we don't
>> know.
>>
>> Aiming at a HA setup, this is deffinitely not what we want. Colleages
>> actually are now at the point that they disqualify airflow.
>>
>> Of course I can provide more details if needed, but I'd like to know
>> first if this is the right place to bring this up.
>>
>> Best,
>>
>> Rolf
>>
>
>
> --
> +48 660 796 129
>


-- 
+48 660 796 129

Reply via email to