On 12/03/21 14:12, Tom Lane wrote: > This reminds me of something I've been intending to bring up, which > is that the "char" type is not very encoding-safe. charout() for > example just regurgitates the single byte as-is.
I wonder if maybe what to do about that lies downstream of some other thought about encoding-related type properties. ISTM we don't, at present, have a clear story for types that have an encoding (or repertoire) property that isn't one of (inapplicable, server_encoding). And yet such things exist, and more such things could or should exist (NCHAR, healthier versions of xml or json, ...). "char" is an existing example, because its current behavior is exactly as if it declared "I am one byte of SQL_ASCII regardless of server setting". Which is no trouble at all when the server setting is also SQL_ASCII. But what does it mean when the server setting and the inherent repertoire property of a type can be different? The present answer isn't pretty. When can charout() be called? typoutput functions don't have any 'internal' parameters, so nothing stops user code from calling them; I don't know how often that's done, and that's a complication. The canonical place for it to be called is inside printtup(), when the client driver has requested format 0 for that attribute. Up to that point, we could have known it was a type with SQL_ASCII wired in, but after charout() we have a cstring, and printtup treats that type as having the server encoding, and it goes through encoding conversion from that to the client encoding in pq_sendcountedtext. Indeed, cstring behaves completely as if it is a type with the server encoding. If you send a cstring with format 1 rather than format 0, while it is no longer subject to the encoding conversion done in pq_sendcountedtext, it will dutifully perform the same conversion in its own cstring_send. unknownsend is the same way. But of course a "char" column in format 1 would never go through cstring; char_send would be called, and just plop the byte in the buffer unchanged (which is the same operation as an encoding conversion from SQL_ASCII to anything). Ever since I figured out I have to look at the send/recv functions for a type to find out if it is encoding-dependent, I have to walk myself through those steps again every time I forget why that is. Having the type's character-encoding details show up in its send/recv functions and not in its in/out functions never stops being counterintuitive to me. But for server-encoding-dependent types, that's how it is: you don't see it in the typoutput function, because on the format-0 path, the transcoding happens in pq_sendcountedtext. But on the format-1 path, the same transcoding happens, this time under the type's own control in its typsend function. That was the second thing that surprised me: we have what we call a text and a binary path, but for an encoding-dependent type, neither one is a path where transcoding doesn't happen! The difference is, the format-0 transcoding is applied blindly, in pq_sendcountedtext, with no surviving information about the data type (which has become cstring by that point). In contrast, on the format-1 path, the type's typsend is in control. In theory, that would allow type-aware conversion; a smarter xml_send could use &#n; form for characters that won't go in the client encoding, while the blind pq transcoding on format 0 would just botch the data. XML, in an ideal world, might live on disk in a form that cares nothing for the server encoding, and be sent directly over the wire to a client (it declares what encoding it's in) and presented to the application over an XML-aware API that isn't hamstrung by the client's default text encoding either. But in the present world, we have somehow arrived at a setup where there are only two paths that can take, and either one is a funnel that can only be passed by data that survives both the client and the server encoding. The FE/BE docs have said "Text has format code zero, binary has format code one, and all other format codes are reserved for future definition" ever since 7.4. Maybe the time will come for a format 2, where you say "here's an encoding ID and some bytes"? This rambled on a bit far afield from "what should charout do with non-ASCII values?". But honestly, either nobody is storing non-ASCII values in "char", and we could make any choice there and nothing would break, or somebody is doing that and their stuff would be broken by any choice of change. So, is the current "char" situation so urgent that it demands some one-off solution be chosen for it, or could it be neglected with minimal risk until someday we've defined what "this datatype has encoding X that's different from the server encoding" means, and that takes care of it? Regards, -Chap