> From: Gavin Smith <[email protected]> > Date: Fri, 10 Nov 2023 19:48:04 +0000 > Cc: Bruno Haible <[email protected]>, [email protected] > > On Fri, Nov 10, 2023 at 08:47:10AM +0200, Eli Zaretskii wrote: > > > Does anybody know if we could just write 'a' instead of U'a' and rely > > > on it being converted? > > > > > > E.g. if you do > > > > > > char32_t c = 'a'; > > > > > > then afterwards, c should be equal to 97 (ASCII value of 'a'). > > > > Why not? What could be the problems with using this? > > I think what was confusing me was the statement that char32_t held a UTF-32 > encoded Unicode character. I then thought it would have a certain byte > order, so if the UTF-32 was big endian, the bytes would have the order > 00 00 00 61, whereas the value 97 on a little endian machine would have > the order 61 00 00 00. However, it seems that UTF-32 just means the > codepoint is encoded as a 32-bit integer, and the endianness of the > UTF-32 sequence can be assumed to match the endianness of the machine. > The standard C integer conversions can be assumed to work when assigning > to/from char32_t because it is just an integer type, I assume.
AFAIU, since a codepoint in UTF-32 is just one UTF-32 unit, the issue of endianness doesn't apply. Endianness in UTF encodings applies only if a codepoint takes more than one unit, since the endianness is between units, not within units themselves (where it always follows the machine).
