Locale-independent paragraph formatting [was Re: Texinfo 7.0.93 pretest available]

Gavin Smith Thu, 09 Nov 2023 13:26:44 -0800

(Reply accidentally not sent to list.)

----- Forwarded message from Gavin Smith <[email protected]> -----

Date: Thu, 9 Nov 2023 20:11:59 +0000
From: Gavin Smith <[email protected]>
To: Bruno Haible <[email protected]>
Subject: Locale-independent paragraph formatting [was Re: Texinfo 7.0.93
        pretest available]

On Tue, Oct 10, 2023 at 07:29:15PM +0200, Bruno Haible wrote:
> Given that the only encoding you want to deal with is UTF-8, Eli's suggestion
> to use GNU libunistring is better than my iconv() suggestion. It has functions
> for width determination:
> https://www.gnu.org/software/libunistring/manual/html_node/uniwidth_002eh.html
> 
> > but I doubt it is urgent to do before the release, as the current approach,
> > however flawed, has been in place and worked fairly well for a long time
> > (since the XS paragraph module was written).
> 
> Well, it does not work on Windows.
> 
> I agree with you that it's not urgent to do before the 7.1 release, since
> the Windows port is work-in-progress.

I have just pushed a commit (e3a28cc9bf) to use gnulib/libunistring
functions instead of the locale-dependent functions mbrtowc and wcwidth.
This allows for a significant simplification as we do not have to try
to switch to a UTF-8 encoded locale.

I was not sure about how to put a char32_t literal in the source code.
For example, where we previously had L'a' as a literal wchar_t letter 'a',
I changed this to U'a'.  I could not find very much information about this
online or whether this would be widely supported by C compilers.  The U prefix
for char32_t is mentioned in a copy of the C11 standard I found online and
also in a C23 draft.

Section 6.4.4.4 "Character constants" (page 67):

    A wide character constant prefixed by the letter L has type wchar_t,
    an integer type defined in the <stddef.h> header; a wide character
    constant prefixed by the letter u or U has type char16_t or char32_t,
    respectively, unsigned integer types defined in the <uchar.h> header.

https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf

The "uchar" or "uchar-c23" modules are supposed to provide uchar.h on
platforms where it doesn't exist, but I highly doubt there is any
mechanism for providing a new character literal syntax if not supported
by the compiler.

Does anybody know if we could just write 'a' instead of U'a' and rely
on it being converted?

E.g. if you do

char32_t c = 'a';

then afterwards, c should be equal to 97 (ASCII value of 'a').

----- End forwarded message -----

Locale-independent paragraph formatting [was Re: Texinfo 7.0.93 pretest available]

Reply via email to