Giving the shared catalogues a defined encoding

Thomas Munro Tue, 03 Dec 2024 04:36:56 -0800

Hello hackers,

Here's a WIP patch that started on a bugs thread[1].

Problem #1: You can have two databases with different encodings, and
they both pretend that pg_database, pg_authid, pg_db_role_setting etc
are in the local database encoding. That doesn't work too well:
non-ASCII text can be reinterpreted in the wrong encoding.

There's no problem if you only use one encoding everywhere (probably
UTF8). There's also no problem if you use multiple database
encodings, but put only ASCII in the shared catalogues (because ASCII
is a subset of every supported server encoding). This patch is about
formalising and enforcing those two working arrangements, hopefully
invisibly to most users. There's still an escape hatch mode if you
need it, e.g. for a non-conforming pg_upgrade'd system.

The patch invents a new setting CLUSTER CATALOG ENCODING, which can be
inspected with SHOW and changed with ALTER SYSTEM. It has three
possible values:

DATABASE: The shared catalogs use the same encoding as this database,
and all databases in this cluster, and all databases have to use the
default encoding configured at initdb time. Database names and roles
names are free to use any characters you like in that one single
encoding. This is the default.

ASCII: The shared catalogs are restricted to 7-bit ASCII, but in
exchange, databases with different encodings are allowed to co-exist.

UNDEFINED: The old behavior, no restrictions.

There's some documentation in the patch to explain that again in more
words, and a regression transcript showing the behaviour, ie things
you can and can't do in each mode, and how the transitions between
modes can be blocked until you make certain changes.

Problem #2: When dealing with new connections, we currently have
trouble with non-ASCII database and role names because the encoding is
undefined for both the catalogue and the network message. With this
patch, at least the catalogue encoding is defined (unless UNDEFINED),
so there's a pathway to straighten that out.

I am open to better terminology, models, etc. The command seems
verbose, but I hope you'd almost never need to run it, so being clear
seemed better than being brief. I had just CATALOG ENCODING in the
previous version, but then it's not clear that it only affects a few
special catalogues (pg_class et al are always in database encoding, as
they're not shared). I tried SHARED CATALOG ENCODING, but that's
not really a SQL word or concept. CLUSTER is, so here I'm trying
that. On the other hand CLUSTER is a bit overloaded. I had explicit
encoding names eg SET ... TO UTF8 in the previous version, but it
seems easier to call it DATABASE encoding given it had to match the
database anyway if not ASCII/UNDEFINED... There could be other ways
to express all this, though. It does still store the real encoding in
the control file, UTF8 -> 6 or whatever, in case that is useful. When
I was using real encoding names in the syntax, I also had SQL_ASCII
for ASCII mode, but that was quite confusing because SQL_ASCII is well
documented as accepting anything at all, whereas here we need 7-bit
ASCII.

Feedback welcome.

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGKKNAc599Vp7kFAnLE1%3DV%3DceYujz_YQoSNrvNFGaJ6i7w%40mail.gmail.com#dfb55ce46ace161d81d60c44229e7c80

v2-0001-Formalize-the-encoding-of-the-shared-catalogs.patch
Description: Binary data

Giving the shared catalogues a defined encoding

Reply via email to