Hello hackers, Here's a WIP patch that started on a bugs thread[1].
Problem #1: You can have two databases with different encodings, and they both pretend that pg_database, pg_authid, pg_db_role_setting etc are in the local database encoding. That doesn't work too well: non-ASCII text can be reinterpreted in the wrong encoding. There's no problem if you only use one encoding everywhere (probably UTF8). There's also no problem if you use multiple database encodings, but put only ASCII in the shared catalogues (because ASCII is a subset of every supported server encoding). This patch is about formalising and enforcing those two working arrangements, hopefully invisibly to most users. There's still an escape hatch mode if you need it, e.g. for a non-conforming pg_upgrade'd system. The patch invents a new setting CLUSTER CATALOG ENCODING, which can be inspected with SHOW and changed with ALTER SYSTEM. It has three possible values: DATABASE: The shared catalogs use the same encoding as this database, and all databases in this cluster, and all databases have to use the default encoding configured at initdb time. Database names and roles names are free to use any characters you like in that one single encoding. This is the default. ASCII: The shared catalogs are restricted to 7-bit ASCII, but in exchange, databases with different encodings are allowed to co-exist. UNDEFINED: The old behavior, no restrictions. There's some documentation in the patch to explain that again in more words, and a regression transcript showing the behaviour, ie things you can and can't do in each mode, and how the transitions between modes can be blocked until you make certain changes. Problem #2: When dealing with new connections, we currently have trouble with non-ASCII database and role names because the encoding is undefined for both the catalogue and the network message. With this patch, at least the catalogue encoding is defined (unless UNDEFINED), so there's a pathway to straighten that out. I am open to better terminology, models, etc. The command seems verbose, but I hope you'd almost never need to run it, so being clear seemed better than being brief. I had just CATALOG ENCODING in the previous version, but then it's not clear that it only affects a few special catalogues (pg_class et al are always in database encoding, as they're not shared). I tried SHARED CATALOG ENCODING, but that's not really a SQL word or concept. CLUSTER is, so here I'm trying that. On the other hand CLUSTER is a bit overloaded. I had explicit encoding names eg SET ... TO UTF8 in the previous version, but it seems easier to call it DATABASE encoding given it had to match the database anyway if not ASCII/UNDEFINED... There could be other ways to express all this, though. It does still store the real encoding in the control file, UTF8 -> 6 or whatever, in case that is useful. When I was using real encoding names in the syntax, I also had SQL_ASCII for ASCII mode, but that was quite confusing because SQL_ASCII is well documented as accepting anything at all, whereas here we need 7-bit ASCII. Feedback welcome. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGKKNAc599Vp7kFAnLE1%3DV%3DceYujz_YQoSNrvNFGaJ6i7w%40mail.gmail.com#dfb55ce46ace161d81d60c44229e7c80
v2-0001-Formalize-the-encoding-of-the-shared-catalogs.patch
Description: Binary data