potiuk commented on issue #18011:
URL: https://github.com/apache/airflow/issues/18011#issuecomment-923419536
> IMHO Airflow should not be falling over in the heartbeats b/c of a
first-time missed connection. There should be some intelligent retry logic in
the heartbeats...
Actually I do not agree with that statement.
Airflow should rely on the metadata database being available at all times
and loosing connectivity in the middle of transaction should not be handled by
Airflow. That adds terrible complexity to your code and IMHO is not needed to
deal with this kind of (apparent) instabilities of connectivity. Especially
that it is a timeout on trying to connect to the database. In case of
SQLAlchemy and ORM database level we often do not have control on when your
session and connection is going to be established and trying to handle all such
failures on application level is complex
AND also it is not needed on application level - especially in case of
Postgres. For quite some time (and also in our Helm Chart - for a long time we
recommend everyone using Postgres to use PGBouncer as a proxy to your Postgres
database. It deals nicely also with a number of connections open (Postgres is
not good in handling many parallel connections - it's connection model is
process based and thus it is resource hungry when there are many connections
opened)
PGBouncer does not only handle managing of connections pools shared between
components, but also allows to react on similar network connection conditions -
first of all, it will reuse existing connections, so there will be far less
connection open/close events between PGBouncer and the Database. All the
connections opened by airflow will go to locally available PGBouncer which will
make them toally resilient to networking issue. Then PGBouncer will handle
errors which you can fine-tune if you have connectivity problems to your
database.
@WattsInABox - can you please add PGBouncer (s) to your deployment and let
us know if that improved the situation. I think this is not even a workaround -
it's actually a good solution (which we generally recommend for any deployment
with Postgres).
I will convert it into discussion until we hear back from you - with your
experiences with PGBouncer and if those problems are still occuring after you
get PGBouncer running, with some reproducible case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]