On 04/11/2020 12.30, Thomas De Schampheleire wrote:
El mié., 4 nov. 2020 a las 11:36, Valentin Kleibel
(<[email protected]>) escribió:
Hi Thomas,
I don't have answers but several questions.
Can you say more about how you set up the database connection? Do you do
anything special here? Why is it different with SSL than without?
About the setup:
the pg_hba.conf for the database contains the line:
hostssl all all 1.2.3.0/24 md5
which enforces the automatic and transparent use of ssl connections to
our database server (that is in the subnet 1.2.3.0/16). To reproduce
this on a local database you could set:
hostssl all all 127.0.0.1/32 md5
to use ssl with it.
I'm sure of one difference: with SSL a connection reused by a forked
process will for sure lead to an error because the connection is
stateful and will, for security reasons, fail if reused.
As far as I can see you would just need to give some extra parameters in
the sqlalchemy.url setting in your ini file.
We were looking through all documented sqlalchemy.url settings and
couldn't find anything of use here.
See for example the description here:
https://github.com/sqlalchemy/sqlalchemy/issues/4146
and a reference here:
https://www.postgresql.org/docs/10/libpq-connect.html#LIBPQ-CONNSTRING
As far as I understand it, you should be able to add ?sslmode=verify-full
to the database connection string, possible with references to
certificate or key files.
We took a look there.
The setting you refer to is needed if the certificate chain of the
database server can't be validate. It can be used when e.g. testing with
a local db and a self-signed cert. Our certificate chain validates and
we don't need this, the ssl connection also works in calls coming from
uwsgi.
But your mail seems to suggest that you have a connection created upfront
and would like this one to be reused?
We do not want to reuse a previously created connection but this is how
the celery workers seem to behave.
Which are the uwsgi-related settings in your ini file? The 'lazy=true'
setting is not a choice for Kallithea and should be enabled in any case.
Our uwsgi template for ini files gives this comment as clarification for
'lazy=true': "App *must* be loaded in workers - db connections can't be
shared"
The only related setting is lazy=true. As you are saying, the database
connections can't be shared. this can be achieved in multiple ways and
one of them (although maybe not the most efficient one) is to use the
lazy setting in uwsgi.
access through uwsgi does work without issues in our setup.
My understanding was that Kallithea creates the db connection in each
worker anyway, and that it shouldn't matter whether SSL is used or
not.
But if I understand things correctly, celery may have multiple
workers, and each worker may create additional forks. If the db is set
up for each worker, it would still be shared between the forks. Is
that matching your understanding and situation?
My understanding is, that the Database connection is only established
once and each celery worker sets up its own Session.
As the underlying SSL connection is not newly established for the worker
it is the shared between forked processes. This leads to the SSL error
which will be triggered if both processes do anything with on this
connection, even a keepalive.
I still wonder where the impact of SSL is. This sounds like a generic
problem, but perhaps it works most of the time if not using SSL, and
so we just don't see it?
The database connection itself might even be safe to use this way (I
honestly don't know) because of the use of sessions, but if SSL is used
on a lower level the SSL connection will for sure die.
Did you use Kallithea in this mode successfully before, perhaps on an older
version? Your mail seems to suggest that but I'm not sure.
Yes, we did use Kallithea before successfully with this setup, with
version 0.4.1. We think the problem arises with the upgrade due to the
way celery workers seem to behave with the new prefork worker model.
This change came with the upgrade of celery to version 4.
As far as I can see, prefork was also used in celery 3.
Could it be that other changes make the difference? For example, I see:
https://docs.celeryproject.org/en/stable/history/whatsnew-4.0.html?highlight=prefork#ofair-is-now-the-default-scheduling-strategy
"-Ofair is now the default scheduling strategy
To re-enable the default behavior in 3.1 use the -Ofast command-line option.
..."
It might be that this change brought the problem to the surface but it
is not a fix.
My understanding is that the problem existed before but the fast
scheduling strategy almost always reused the same process and thus the
ssl context was almost always fine.
The fair scheduling strategy might not show the problem as fast but in
the end no 2 workers will be able to run concurrently in this case either.
Note: I found this related question:
https://stackoverflow.com/questions/51466007/how-to-use-psycopg2-properly-in-a-prefork-celery-environment
but it does not involve sqlalchemy and does not really give immediate
solutions.
While looking around for solutions we found this blogpost:
https://virtualandy.wordpress.com/2019/09/04/a-fix-for-operationalerror-psycopg2-operationalerror-ssl-error-decryption-failed-or-bad-record-mac/
Which is about uwsgi but also describes an alternative solution to
setting lazy=true which is to call engine.dispose in order to make sure
a new connection is used.
This is the same solution we also found in the sqalchemy documentation
in the section about connection pools use with mutiprocessing or forked
processes:
https://docs.sqlalchemy.org/en/13/core/pooling.html#using-connection-pools-with-multiprocessing-or-os-fork
We implemented this in kallithea/lib/celerylib/__init__.py for the
celery workers.
Attached is a diff for the current stable release which works well in
our setup, although we are not sure if it has any unwanted side effects
we just didn't encounter by now.
Thanks for the patch and links.
So with that patch, you no longer see issues and you can have
'celery.worker_concurrency' set higher than 1 ?
Yes, we have set:
celery.worker_concurrency = 2
celery.worker_max_tasks_per_child = 1
without any further changes and did no longer see those issues.
I would like to await the analysis of Mads Kiilerich on this.
I'm looking forward to it,
Thanks,
Valentin
_______________________________________________
kallithea-general mailing list
[email protected]
https://lists.sfconservancy.org/mailman/listinfo/kallithea-general