Re: Bug in clang?

Walter Alejandro Iglesias Wed, 20 Aug 2025 09:15:42 -0700

On Wed, Aug 20, 2025 at 04:33:47PM +0200, Ingo Schwarze wrote:
> Hell Walter,
> 
> Walter Alejandro Iglesias wrote on Wed, Aug 20, 2025 at 09:18:52AM +0200:
> > On Tue, Aug 19, 2025 at 05:39:13PM +0200, Ingo Schwarze wrote:
> >> Walter Alejandro Iglesias wrote on Mon, Aug 18, 2025 at 06:40:04PM +0200:
> 
> >>> #define period    0x2e
> >>> #define question  0x3f
> >>> #define exclam    0x21
> >>> #define ellipsis  L'\u2026'
> >>> const wchar_t p[] = { period, question, exclam, ellipsis };
> 
> >> In addition to what otto@ said, this is bad style for more than one
> >> reason.
> >> 
> >> First of all, that data type of the constant "0x2e" is "int",
> >> see for example C11 6.4.4.1 (Integer constants).  Casting "int"
> >> to "wchar_t" doesn't really make sense.  On OpenBSD, it only
> >> works because UTF-8 is the only supported character encoding *and*
> >> wchar_t stores Unicode codepoints.  But neither of these choices
> >> are portable.  What you want is (C11 6.4.4.4 Character constants):
> >> 
> >>   #define period   L'.'
> >>   #define question L'?'
> >>   #define exclam   L'!'
> 
> > As I made this change to my code (https://en.roquesor.com/fmtroff.html)
> > the following reminded me why, at some point, I decided to switch to
> > hexadecimal notation.
> > 
> >   #define backslash L'\\'
> >   #define apostrophe        L'\''
> > 
> > It isn't very confusing there, but among the arguments of a function or
> > a conditional...
> 
> Making code look nice is nice to have and can even make code more
> readable and hence reduce the likelihood of bugs.  But even if you
> are coding with narrow strings for ASCII only, whether
> 
>   char mychar = 0x5c;
>   char mychar = 92;
>   char mychar = 0134;
> 
> is more readable than 
> 
>   char mychar = '\\';
> 
> is debateable; at least i would find reading the latter easier than
> the former, even in a conditional or function call argument.


If it weren't because I don't like using UTF-8 characters in the code (I
use vi(1) from base to code), I would write the characters themselves
directly, both narrow and wide.  That's undoubtely the most human
readable option. :-)


> 
> For narrow characters, the portability argument is weak; writing
> code that is portable to EBCDIC machines is the kind of excessive
> portability that provokes bugs rather than prevent them.  But still,
> i'd recommend against specifying narrow characters numerically.
> Even mandoc_char(7) says:
> 
>   NUMBERED CHARACTERS
>      For backward compatibility with existing manuals, mandoc(1)
>      also supports the
>            \N'number' and \[charnumber]
>      escape sequences, inserting the character number from the
>      current character set into the output.  Of course, this is
>      inherently non-portable and is already marked as deprecated
>      in the Heirloom roff manual; on top of that, the second form
>      is a GNU extension.  For example, do not use \N'34' or
>      \[char34], use \(dq, or even the plain `"' character where
>      possible.

In my Groff files, for Spanish, except for a definition I added to my
macros for the UTF-8 ellipsis (out of the reach of preconv(1)), I write
all UTF-8 characters as is.

> 
> A similar recommendation makes sense for C code.
> 
> What *is* portable is specifying wide characters by Unicode
> codepoint numbers, for example:
> 
>   wchar_t mywide = L'\u2026';  /* horizontal ellipsis */
> 
> But note that the C standard (C11 6.4.3.2 Universal character names)
> explicitly requires the argument to \u to be at least 00A0,
> with only three exceptions:
> 
>   L'\u0024' == L'$'
>   L'\u0040' == L'@'
>   L'\u0060' == L'`'
> 
> Being so specific is a weird quirk of the standard, but it means
> you should better not abuse \u to obfuscate ASCII codepoints -
> apart from being very ugly, it may not even work.  For example,
> current base clang dies like this:
> 
>   error: character 'A' cannot be specified by a universal character name
>     13 |         wchar_t mywide = L'\u0041';
>   1 error generated.
> 
> So there is no real alternative to L'\\'.  While L'\x5c' and L'\134'
> work for UTF-8 (and hence on OpenBSD), even that is not guaranteed
> to be portable, and what those two produce may depend both on the
> implementation and on the locale.

I already changed all my ASCII character definitions to the notation you
advice and left the UTF-8 ones with the L'\u????' code:

  https://en.roquesor.com/Downloads/fmtroff.c

Here I mention your help:

  https://en.roquesor.com/fmtroff.html


Andando y aprendiendo. :-)


> 
> Yours,
>   Ingo
> 

-- 
Walter

Re: Bug in clang?

Reply via email to