types reliant on encodings [was Re: Dubious usage of TYPCATEGORY_STRING]

Chapman Flack Fri, 03 Dec 2021 10:42:52 -0800

On 12/02/21 16:22, Tom Lane wrote:
> ... types belonging to the STRING category, which are predicated
> on the assumption that such types are reasonably general-purpose
> string types.


This prods me to submit a question I've been incubating for a while.

Is there any way to find out, from the catalogs or in any automatable way,
which types are implemented with a dependence on the database encoding
(or on some encoding)?

You might think S category types, for a start: name, text, character,
varchar, all dependent on the server encoding, as you'd expect. The ones
Tom moves here to category Z were most of the ones I wondered about.

Then there's "char". It's category S, but does not apply the server
encoding. You could call it an 8-bit int type, but it's typically used
as a character, making it well-defined for ASCII values and not so
for others, just like SQL_ASCII encoding. You could as well say that
the "char" type has a defined encoding of SQL_ASCII at all times,
regardless of the database encoding.

U types are a mixed bag. Category U includes bytea (no character encoding)
and xml/json/jsonb (server encoding). Also tied to the server encoding
are cstring and unknown.

As an aside, I think it's unfortunate that the xml type has this implicit
dependency on the server encoding, when XML is by definition Unicode.
It means there are valid XML documents that PostgreSQL may not be able
to store, and which documents those are depends on what the database
encoding is. I think json and jsonb suffer in the same way.

Changing that would be disruptive at this point and I'm not suggesting it,
but there might be value in the thought experiment to see what the
alternate universe would look like.

In the alternate world, you would know that certain datatypes were
inherently encoding-oblivious (numbers, polygons, times, ...), certain
others are bound to the server encoding (text, varchar, name, ...), and
still others are bound to a known encoding other than the server encoding:
the ISO SQL NCHAR type (bound to an alternate configurable database
encoding), "char" (always SQL_ASCII), xml/json/jsonb (always with the full
Unicode repertoire, however they choose to represent it internally).

That last parenthetical reminded me that I'm really talking
about 'repertoire' here, which ISO SQL treats as a separate topic from
'encoding'. Exactly how an xml or jsonb type is represented internally
might be none of my business (unless I am developing a binary-capable
driver), but it's fair to ask what its repertoire is, and whether that's
full Unicode or not, and if not, whether the repertoire changes when some
server setting does.

I also think in that ideal world, or even this one, you could want
some way to query the catalogs to answer that basic question
about some given type.

Am I right that we simply don't have that? I currently answer such questions
by querying the catalog for the type's _send or _recv function name, then
going off to read the code, but that's hard to automate.

Regards,
-Chap

types reliant on encodings [was Re: Dubious usage of TYPCATEGORY_STRING]

Reply via email to