Re: Unicode string literals

2020-04-30 Thread Paul Eggert
On 4/30/20 2:05 PM, Marc Nieper-Wißkirchen wrote:
>> Could we have a macro to be used only in source code encoded via UTF-8?
>> Presumably the older compilers would process the bytes of the string as if 
>> they
>> were individual 8-bit characters and would pass them through unchanged, so 
>> the
>> run-time string would be UTF-8 too.

> This would allow writing a macro that prefixes "u8" to strings in
> compilers supporting enough of C11, skipping the prefix in compilers
> that pass UTF-8 encoded bytes in strings unchanged

Yes, that was the idea.

> and signal an error
> in all other cases (hopefully only very exotic platforms), right?

I wasn't thinking of requiring a diagnostic of that case, at least not reliably.
Not sure it's worth worrying about.



Re: xsize and flexmember

2020-04-30 Thread Paul Eggert
On 4/30/20 2:01 PM, Marc Nieper-Wißkirchen wrote:

 #define XFLEXSIZEOF_XSIZE(type, member, n) \
   (((n) <= FLEXSIZEOF (type, member, n) \
 && FLEXSIZEOF (type, member, n) <= (size_t) -1) \
? (size_t) FLEXSIZEOF (type, member, n) : (size_t) -1)
> 
> Why do you write "(n) <= FLEXSIZEOF (type, member, n)" and not "n <
> FLEXSIZEOF (type, member, n)"? In case MEMBER is the first element of
> TYPE, this would not indicate an overflow, would it?

If n == FLEXSIZEOF (type, member, n) then overflow has not occurred, yes. And in
that case the function should yield n. (Admittedly this case would be rare)

> My idea was:
> 
> #define XFLEXSIZEOF_XSIZE(type, member, n) xflexsizeof_xsize_bound(
> FLEXSIZEOF (type, member, n), n)
> static _GL_INLINE size_t xflexsizeof_xsize_bound (umaxint_t m, size_t n)
> {
>   if (n < m && m <= (size_t) -1)
> return m;
>   else
> return (size_t) -1;
> }

This would require including stdint.h to get uintmax_t, which adds a dependency.
Also, xflexsizeof_xsize_bound shouldn't be a static function since extern inline
functions can't call static functions, though that should be easy to fix.
There's also the theoretical problem that INTMAX_MAX might be greater than
UINTMAX_MAX, but perhaps we needn't worry about that

I can see going either way on this. As a macro, FLEXSIZEOF_XSIZE could insist
that its last argument be free of side effects, and that would be simpler on the
implementation. It's an annoying restriction, though.

> maybe FLEXSIZEOF_XSIZE, which would at least drop the
> leading "x" as we no error is signaled. :)

Yes, good point.




Re: Unicode string literals

2020-04-30 Thread Marc Nieper-Wißkirchen
Am Do., 30. Apr. 2020 um 22:54 Uhr schrieb Paul Eggert :
>
> On 4/30/20 6:08 AM, Bruno Haible wrote:
> > These not-so-new compilers don't perform
> > character set conversion; you have to provide the numeric value of each
> > byte yourself (either as escapes, or by enumerating the bytes of the
> > string one by one).
>
> Could we have a macro to be used only in source code encoded via UTF-8?
> Presumably the older compilers would process the bytes of the string as if 
> they
> were individual 8-bit characters and would pass them through unchanged, so the
> run-time string would be UTF-8 too.

This would allow writing a macro that prefixes "u8" to strings in
compilers supporting enough of C11, skipping the prefix in compilers
that pass UTF-8 encoded bytes in strings unchanged and signal an error
in all other cases (hopefully only very exotic platforms), right?



Re: Unicode string literals

2020-04-30 Thread Paul Eggert
On 4/30/20 6:08 AM, Bruno Haible wrote:
> These not-so-new compilers don't perform
> character set conversion; you have to provide the numeric value of each
> byte yourself (either as escapes, or by enumerating the bytes of the
> string one by one).

Could we have a macro to be used only in source code encoded via UTF-8?
Presumably the older compilers would process the bytes of the string as if they
were individual 8-bit characters and would pass them through unchanged, so the
run-time string would be UTF-8 too.



Re: xsize and flexmember

2020-04-30 Thread Paul Eggert
On 4/29/20 11:39 PM, Marc Nieper-Wißkirchen wrote:

>> #define XFLEXSIZEOF_XSIZE(type, member, n) \
>>   (((n) <= FLEXSIZEOF (type, member, n) \
>> && FLEXSIZEOF (type, member, n) <= (size_t) -1) \
>>? (size_t) FLEXSIZEOF (type, member, n) : (size_t) -1)
>>
>> A couple of problems with this approach:
>>
>>   * It evaluates N more than once.
> 
> Couldn't this be solved by calling a static function that would be
> subject to be inlined?

I don't offhand see how to get that to work if n exceeds SIZE_MAX.

> Why would you prefer the (longer) name XFLEXSIZEOF_XSIZE vs XFLEXSIZEOF?

It's specialized for size_t computations, and is not in general suitable for
ptrdiff_t or other types. Also, elsewhere in Gnulib a leading "x" means the
function signals an error if overflow occurs, and that's not what's happening
here. I realize we have dueling conventions here, but would prefer that
saturated size_t arithmetic have a longer prefix or suffix than just "x".



Re: pure and const function attributes

2020-04-30 Thread Marc Nieper-Wißkirchen
Am Mi., 29. Apr. 2020 um 18:05 Uhr schrieb Marc Nieper-Wißkirchen
:
>
>
> Paul Eggert  schrieb am Mi., 29. Apr. 2020, 18:01:
>>
>> On 4/29/20 7:28 AM, Marc Nieper-Wißkirchen wrote:
>> > I am wondering whether it makes sense to add two new modules, named
>> > pure and const that define macros GL_PURE and GL_CONST, respectively
>>
>> There's already _GL_ATTRIBUTE_PURE and _GL_ATTRIBUTE_CONST. Presumably you 
>> just
>> want them exposed? (I confess that Emacs already uses the latter)
>
>
> That would be perfect!

P.S.: It would also be helpful so that warnings coming from
"-Wsuggest-attribute=pure" can be handled for the GCC without
affecting other compilers.



Re: Unicode string literals

2020-04-30 Thread Bruno Haible
Hi Marc,

> I was hoping that compilers not supporting enough of C11
> would have some other way to translate from the source file encoding
> to UTF-8, which could be exploited by Gnulib.

No, that's not the case. These not-so-new compilers don't perform
character set conversion; you have to provide the numeric value of each
byte yourself (either as escapes, or by enumerating the bytes of the
string one by one).

> > Your best bet is
> >   1) For UTF-8 encoded strings, ensure that your source code is UTF-8
> >  encoded, or use escapes, like in gnulib/tests/uniwidth/test-u8-width.c.
> 
> Using escapes for non-ASCII characters, it will work whenever the
> execution character set of the compiler is compatible with ASCII,
> right?

The only system where the execution character set is not compatible with
ASCII is z/OS. Daniel Richard G. is our expert regarding this platform.
My understanding is that
  - there are some facilities in the compiler, but we cannot make use of
them in gnulib,
  - there are some facilities in the run-time library, and Daniel knows
how to make use of them with gnulib,
  - overall it's case-by-case coding; there's no simple magic wand for it.

> > > for pre-C2x systems would be nice so that ASCII("c") expands into the
> > > ASCII code point of the character `c'.
> >
> > What's the point of this one? Why not just write 'c'?
> 
> I was thinking of a system whose execution character set is not
> compatible with ASCII.

You can have a statically allocated translation table from EBCDIC to ASCII
and write a macro that expands to ebcdic_to_ascii['c']. But that will not
be a constant expression. So, e.g. you cannot use this in a 'switch' statement.
And you cannot build a getopt option string from it either. And so on.

Bruno




Re: Unicode string literals

2020-04-30 Thread Marc Nieper-Wißkirchen
Hi Bruno,

thank you very much for your reply.

Am Do., 30. Apr. 2020 um 12:06 Uhr schrieb Bruno Haible :

[...]

> Unfortunately, we cannot provide such macros. The reason is that the
> translation from the source file's encoding to UTF-8/UTF-16/UTF-32 must
> be done by the compiler, if you want to be able to write
>   static uint8_t my_string[] = u8"Wißkirchen";

For a compiler that supports the "u8" prefix, which is defined by C11,
the compiler should do the translation from the source file encoding
to UTF-8.  I was hoping that compilers not supporting enough of C11
would have some other way to translate from the source file encoding
to UTF-8, which could be exploited by Gnulib.

> Your best bet is
>   1) For UTF-8 encoded strings, ensure that your source code is UTF-8
>  encoded, or use escapes, like in gnulib/tests/uniwidth/test-u8-width.c.

Using escapes for non-ASCII characters, it will work whenever the
execution character set of the compiler is compatible with ASCII,
right?

>   2) For UTF-16 encoded strings, which you'll need only on Windows,
>  write L"Wißkirchen". Or use hex codes, like in
>  gnulib/tests/uniwidth/test-u16-width.c.
>   3) Don't use UTF-32 encoded strings. Or use hex codes, like in
>  gnulib/tests/uniwidth/test-u32-width.c.

These two are less important for me; I mentioned them to have a full
set of macros.

>
> > Similarly, something like
> >
> > #define ASCII(s) (u8 ## s [0])
> >
> > for pre-C2x systems would be nice so that ASCII("c") expands into the
> > ASCII code point of the character `c'.
>
> What's the point of this one? Why not just write 'c'?

I was thinking of a system whose execution character set is not
compatible with ASCII. Or are those excluded in general by Gnulib?

Thanks again,

Marc



Re: Unicode string literals

2020-04-30 Thread Bruno Haible
Hi Marc,

Marc Nieper-Wißkirchen wrote:
> On a system that supports at least C11, I can create an UTF8-encoded
> literal string through:
> 
> (uint8_t const *) u8"..."
> 
> Could Gnulib abstract this into a macro so that substitutes for
> systems that do not have u8 string literals can be provided.
> 
> On a C11 system, we would have
> 
> #define UTF8(s) ((uint8_t const *) u8 ## s)
> 
> and similar definitions for UTF16 and UTF32.

Unfortunately, we cannot provide such macros. The reason is that the
translation from the source file's encoding to UTF-8/UTF-16/UTF-32 must
be done by the compiler, if you want to be able to write
  static uint8_t my_string[] = u8"Wißkirchen";

Your best bet is
  1) For UTF-8 encoded strings, ensure that your source code is UTF-8
 encoded, or use escapes, like in gnulib/tests/uniwidth/test-u8-width.c.
  2) For UTF-16 encoded strings, which you'll need only on Windows,
 write L"Wißkirchen". Or use hex codes, like in
 gnulib/tests/uniwidth/test-u16-width.c.
  3) Don't use UTF-32 encoded strings. Or use hex codes, like in
 gnulib/tests/uniwidth/test-u32-width.c.

> Similarly, something like
> 
> #define ASCII(s) (u8 ## s [0])
> 
> for pre-C2x systems would be nice so that ASCII("c") expands into the
> ASCII code point of the character `c'.

What's the point of this one? Why not just write 'c'?

Bruno




Unicode string literals

2020-04-30 Thread Marc Nieper-Wißkirchen
On a system that supports at least C11, I can create an UTF8-encoded
literal string through:

(uint8_t const *) u8"..."

Could Gnulib abstract this into a macro so that substitutes for
systems that do not have u8 string literals can be provided.

On a C11 system, we would have

#define UTF8(s) ((uint8_t const *) u8 ## s)

and similar definitions for UTF16 and UTF32.

Similarly, something like

#define ASCII(s) (u8 ## s [0])

for pre-C2x systems would be nice so that ASCII("c") expands into the
ASCII code point of the character `c'.



Re: xsize and flexmember

2020-04-30 Thread Marc Nieper-Wißkirchen
Thank you very much for your quick response!

Am Do., 30. Apr. 2020 um 00:39 Uhr schrieb Paul Eggert :
>
> On 4/29/20 12:29 PM, Marc Nieper-Wißkirchen wrote:
> > It would be great if the flexmember exported another macro, say
> > XFLEXSIZEOF, which returned SIZE_MAX in case of arithmetic overflow.
>
> Something like this?
>
> /* Like FLEXSIZEOF, except yield SIZE_MAX on arithmetic overflow,
>and N might be evaluated more than once.  */
>
> #define XFLEXSIZEOF_XSIZE(type, member, n) \
>   (((n) <= FLEXSIZEOF (type, member, n) \
> && FLEXSIZEOF (type, member, n) <= (size_t) -1) \
>? (size_t) FLEXSIZEOF (type, member, n) : (size_t) -1)
>
> A couple of problems with this approach:
>
>   * It evaluates N more than once.

Couldn't this be solved by calling a static function that would be
subject to be inlined?

>
>   * If the FLEXSIZEOF calls appears in a ptrdiff_t context it might not
> return the right value. ptrdiff_t is also a popular way
> to compute sizes.

Maybe a warning in the comment above the macro's definition would be enough.

>
> But perhaps it's good enough.

Why would you prefer the (longer) name XFLEXSIZEOF_XSIZE vs XFLEXSIZEOF?