Hello Ingo,

On Tue, Aug 19, 2025 at 05:39:13PM +0200, Ingo Schwarze wrote:
> Hi Walter,
> 
> Walter Alejandro Iglesias wrote on Mon, Aug 18, 2025 at 06:40:04PM +0200:
> 
> > Question for the experts.  Let's take the following example:
> > 
> > ----->8------------->8--------------------
> > #include <stdio.h>
> > #include <string.h>
> > #include <wchar.h>
> > 
> > #define period              0x2e
> > #define question    0x3f
> > #define exclam              0x21
> > #define ellipsis    L'\u2026'
> > 
> > const wchar_t p[] = { period, question, exclam, ellipsis };
> 
> In addition to what otto@ said, this is bad style for more than one
> reason.
> 
> First of all, that data type of the constant "0x2e" is "int",
> see for example C11 6.4.4.1 (Integer constants).  Casting "int"
> to "wchar_t" doesn't really make sense.  On OpenBSD, it only
> works because UTF-8 is the only supported character encoding *and*
> wchar_t stores Unicode codepoints.  But neither of these choices
> are portable.  What you want is (C11 6.4.4.4 Character constants):
> 
>   #define period      L'.'
>   #define question    L'?'
>   #define exclam      L'!'

As I explain below I did that in a program I wrote to work with UTF-8
only.  But I'll follow your advice and adopt this practice from now on.

> 
> > int
> > main()
> > {
> >     const wchar_t s[] = L". Hello.";
> > 
> >     printf("%ls\n", s);
> >     printf("%lu\n", wcsspn(s, p));
> 
> The return value of wcsspn(3) is size_t, so this should use %zu.

Yeah, the compiler warned me about this.  I wrote the example
carelessly.

> 
> Besides, given that the second argument of wcsspn(3)
> takes "const wchar_t *", why not simply:
> 
>   const wchar_t *p = L".?!\u2026";

I'd tried this:

  const wchar_t p[] = L".?!\u2026";

and saw that it solved the problem, *but I didn't undesrtand why*.  My
mistake was assuming that since this syntax didn't require specifying
the length in the brakets, neither did the one I used.

By the way, the program where I experienced the failures is this:

  https://en.roquesor.com/Downloads/fmtroff.c

As you can see in the code, my intention was to define all the
characters in a legible, clear, and practical way but, after
encountering this problem, I seriously wondered if I'd made my life
complicated by writing it like this.

> 
> And finally, if you want wchar_t to store UTF-8 strings, you need
> something like
> 
>   #include <err.h>
>   #include <locale.h>
> 
>   if (setlocale(LC_CTYPE, "C.UTF-8") == NULL)
>       errx(1, "setlocale failed");
> 
> Otherwise, the C library function operating on wide strings
> assume that wchar_t only stores ASCII character numbers.
> Even printf(3) %ls won't work for UTF-8 characters without
> setting the locale properly.

Yes, it was an oversight on my part not to include setlocale() in the
example.  By the way, If you take a look to fmtroff.c you'll see this
line:

   setlocale(LC_CTYPE, "");

My intention with fmtroff was to have it work only with UTF-8, so first
I'd used the UTF-8 specification in setlocale() as in your example.
Later I decided to leave that field empty because, after testing under
Linux, I found that with other locales, except for that it doesn't take
advantage of UTF-8 hardcoded punctuation, the program also does its job.

As it happens with wide character functions, the problem comes when,
under UTF-8 locale, you edit a file containing non valid UTF-8
characters.  My previous version of the program was written without
wide-char functions and, as fmt(1) from base, it hasn't this problem.
Each version has its pro an cons.  I use it as a more suitable version
of fmt(1) to edit my novels in Spanish with Groff.


> 
> Yours,
>   Ingo
> 

-- 
Walter

Reply via email to