Re: Locale-independent paragraph formatting

Eli Zaretskii Fri, 10 Nov 2023 22:29:25 -0800

> From: Gavin Smith <[email protected]>
> Date: Fri, 10 Nov 2023 19:48:04 +0000
> Cc: Bruno Haible <[email protected]>, [email protected]
> 
> On Fri, Nov 10, 2023 at 08:47:10AM +0200, Eli Zaretskii wrote:
> > > Does anybody know if we could just write 'a' instead of U'a' and rely
> > > on it being converted?
> > > 
> > > E.g. if you do
> > > 
> > > char32_t c = 'a';
> > > 
> > > then afterwards, c should be equal to 97 (ASCII value of 'a').
> > 
> > Why not?  What could be the problems with using this?
> 
> I think what was confusing me was the statement that char32_t held a UTF-32
> encoded Unicode character.  I then thought it would have a certain byte
> order, so if the UTF-32 was big endian, the bytes would have the order
> 00 00 00 61, whereas the value 97 on a little endian machine would have
> the order 61 00 00 00.  However, it seems that UTF-32 just means the
> codepoint is encoded as a 32-bit integer, and the endianness of the
> UTF-32 sequence can be assumed to match the endianness of the machine.
> The standard C integer conversions can be assumed to work when assigning
> to/from char32_t because it is just an integer type, I assume.


AFAIU, since a codepoint in UTF-32 is just one UTF-32 unit, the issue
of endianness doesn't apply.  Endianness in UTF encodings applies only
if a codepoint takes more than one unit, since the endianness is
between units, not within units themselves (where it always follows
the machine).

Re: Locale-independent paragraph formatting

Reply via email to