Hi all, $subject is something that has been on my mind for a few weeks now, following the recent events with CVE-2025-4207 (627acc3caa74) and CVE-2025-1094 (5dc1e42b4fa6).
All the encodings supported are documented here: https://www.postgresql.org/docs/devel/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED One pain point in the code is with encoding GB18030, which has the particularity to require a look at the first two bytes of an input to know what's the actual length of a multi-byte character sequence. This is not documented, and it can be a trapped in disguise, particularly with the frontend code (see jsonapi.c). With all that in mind, I have wanted to kick a discussion about potentially removing one or more encodings from the core code, including the backend part, the frontend part and the conversion routines, coupled with checks in pg_upgrade to complain with database or collations include the so-said encoding (the collation part needs to be checked when not using ICU). Just being able to removing GB18030 would do us a favor in the long-term, at least, but there's more. I have discussed the matter internally, with a few things pointed out: - One thing that I was considering first would be the possibility to add support for pluggable encodings in the backend code, giving an option for retired encodings to be reloaded back to the server, with a concept close to what we do for WAL RMGRs with IDs stuck in time once defined, catalogs using pg_enc. Encouraging users to have their own encodings, particularly ones that we'd consider to be unsafe by design like the GB one may not be a good idea. But there is always the argument that users may not want to pay the cost of a set of ALTER DATABASE commands. Nobody really liked this idea of putting the encoding responsibility into an extension :D - Another idea, that Jeff Davis has mentioned is around unicode point U+FFFD (didn't know about this one) that can be used to replace an incoming character whose value is unknown. One strategy would then be to map encodings whose internals are dropped to use UTF-8 underground, with this character as exit path when finding characters that cannot be understood, meaning partial and silent data loss. Another set of things (also mentioned by Jeff as he's been diving into this area a lot for the last few years with Jeremy Schneider), that could also help $subject in the long-run, would be to try removing some code used for non-UTF8 cases. Some examples: - downcase_identifier() and pgstrcasecmp.c mention the specific case of Turkish with 'i' and 'I'. - Simplify regc_pg_locale.c which is unable to support non-UTF8 encodings with characters of more than 2 bytes. - pg_wchar's uint type could be removed, switched to a codepoint value (?) (pointed out by Jeff). - Varlena cases with non-URF8, like text_position_setup(). In theory, what we could aim for here is to move forward with non-UTF8 encodings in the server, potentially moving away from libc. That's a larger project, so it may be better to try something with some of the low-hanging fruits like the non-UTF8 cases. This last paragraph does not really my opinion about GB18030: I'd like to propose its removal for v19 because looking at the first two bytes of a character sequence to know how long the full sequence is stands as an exception compared to all the encodings supported by Postgres. Anyway, at the end, all that is about removing code. A large majority of users use UTF-8, we could improve things, so feel free to comment. Feel free to use this thread if you have different ideas or if you have any comments. Thanks, -- Michael
signature.asc
Description: PGP signature