[ breaking off a different new thread ] Chapman Flack <c...@anastigmatix.net> writes: > Then there's "char". It's category S, but does not apply the server > encoding. You could call it an 8-bit int type, but it's typically used > as a character, making it well-defined for ASCII values and not so > for others, just like SQL_ASCII encoding. You could as well say that > the "char" type has a defined encoding of SQL_ASCII at all times, > regardless of the database encoding.
This reminds me of something I've been intending to bring up, which is that the "char" type is not very encoding-safe. charout() for example just regurgitates the single byte as-is. I think we deemed that okay the last time anyone thought about it, but that was when single-byte encodings were the mainstream usage for non-ASCII data. If you're using UTF8 or another multi-byte server encoding, it's quite easy to get an invalidly-encoded string this way, which at minimum is going to break dump/restore scenarios. I can think of at least three ways we might address this: * Forbid all non-ASCII values for type "char". This results in simple and portable semantics, but it might break usages that work okay today. * Allow such values only in single-byte server encodings. This is a bit messy, but it wouldn't break any cases that are not problematic already. * Continue to allow non-ASCII values, but change charin/charout, char_text, etc so that the external representation is encoding-safe (perhaps make it an octal or decimal number). Either of the first two ways would have to contemplate what to do with disallowed values that snuck into the DB via pg_upgrade. That leads me to think that the third way might be the most preferable, even though it's not terribly backward-compatible. There's a nearby issue that we might do something about at the same time, which is that chartoi4() and i4tochar() think that the byte value of a "char" is signed, while all the other operations treat it as unsigned. I wouldn't be too surprised if this behavior is the direct cause of the bug fixed in a6bd28beb. The issue vanishes if we forbid non-ASCII values, but otherwise I'd be inclined to change these functions to treat the byte values as unsigned. Thoughts? regards, tom lane