Re: Unicode string literals

2020-05-01 Thread Daniel Richard G.
Hi everyone, I've been watching this discussion.

On Fri, 2020 May  1 18:52-04:00, Bruno Haible wrote:
> 
> Yes, this is unlikely. In a world where people routinely do a "git pull" from
> upstream repositories and send patches or pull requests upstream, every
> automated downstream manipulation of the source code - even as small as
> transforming CR/LF to LF - becomes a PITA.

Much agreed.

For what it's worth, I'll mention the following points:

* XLC on z/OS does not appear to support u8"..." strings, either in my
  tests or in the documentation I've searched. The most I can confirm is
  support for u"..." (UTF-16) and U"..." (UTF-32) literals.

* When source code is brought in to a z/OS system for compilation, it is
  typically blanket-converted to e.g. IBM-1047 (which maps one-to-one to
  Latin-1) as the first step. Same for scripts and other files
  (binary blobs become a headache, yes). It is possible to coerce XLC to
  compile C source in ASCII encoding, but this never happens in
  practice, because the shell/make interpreters will choke on ASCII
  input well before that point.

* UTF-8 characters in a source file is an awkward situation anyway,
  because the z/OS user environment itself does not support multibyte
  encodings. The typical (EBCDIC) encodings used are all single-byte.
  UTF-EBCDIC exists but it is not a thing on z/OS.

* The general assumption is that programs running on z/OS may process
  UTF-8 data (multibyte functions are provided, iconv knows about UTF-8,
  etc.), but their interaction with the user environment is entirely
  through a single-byte encoding.

* Obviously, the set of users who interact with a mainframe directly
  through a Unix shell is very small, which is why encoding support in
  the z/OS user environment feels like a throwback to 1999.

* I'm not aware of many cases where string-literal encodings have been
  an issue in z/OS; the immediate example that comes to mind is e.g.
  gnulib/tests/test-iconv-utf.c, which requires test strings to be ASCII-
  encoded. You can see the use of XLC's "#pragma convert()" there. But
  routine scenarios, like getopt() option letters, don't need to do
  anything special to work as intended.

* If there are any tricky encoding-related issues you are trying to
  solve, I'm of course happy to try out proposed solutions :-)


--Daniel


-- 
Daniel Richard G. || sk...@iskunk.org
My ASCII-art .sig got a bad case of Times New Roman.



Re: Unicode string literals

2020-05-01 Thread Bruno Haible
Paul Eggert wrote:
> I was thinking about the case where one develops and normally builds on 
> systems
> that assume UTF-8 source code (perhaps because a build system is old and just
> compiles the bytes unchecked), but that on occasion a builder might translate
> all the source code to (say) EUC-JP for whatever reason, and then compile on a
> newer platform that supports the u8 prefix.
> 
> Admittedly the scenario is unlikely.

Yes, this is unlikely. In a world where people routinely do a "git pull" from
upstream repositories and send patches or pull requests upstream, every
automated downstream manipulation of the source code - even as small as
transforming CR/LF to LF - becomes a PITA.

Bruno




Re: Unicode string literals

2020-05-01 Thread Paul Eggert
On 5/1/20 2:01 AM, Bruno Haible wrote:

> Did you mean (1) that the programmer shall define a macro, that indicates that
> their source code is UTF-8 encoded?
> 
> Or did you mean (2) that gnulib shall define a macro, that shall _assume_ that
> the source code is UTF-8 encoded, and then expand to u8"xyz" instead of "xyz"?

Yes, I meant (2).

> For (2): what's the point? Once you assume that the source code is UTF-8
> encoded, ISO C11 section 6.4.5 says that u8"xyz" and "xyz" are the same:
> literals of type 'char *'.

I was thinking about the case where one develops and normally builds on systems
that assume UTF-8 source code (perhaps because a build system is old and just
compiles the bytes unchecked), but that on occasion a builder might translate
all the source code to (say) EUC-JP for whatever reason, and then compile on a
newer platform that supports the u8 prefix.

Admittedly the scenario is unlikely. I suppose we should wait until a real need
arises before worrying about it.

This all reminds me of trigraphs somehow
. What a pain that was,
and still is.



Re: Unicode string literals

2020-05-01 Thread Bruno Haible
Hi Paul,

> >> Could we have a macro to be used only in source code encoded via UTF-8?
> >> Presumably the older compilers would process the bytes of the string as if 
> >> they
> >> were individual 8-bit characters and would pass them through unchanged, so 
> >> the
> >> run-time string would be UTF-8 too.
> 
> > This would allow writing a macro that prefixes "u8" to strings in
> > compilers supporting enough of C11, skipping the prefix in compilers
> > that pass UTF-8 encoded bytes in strings unchanged
> 
> Yes, that was the idea.

Did you mean (1) that the programmer shall define a macro, that indicates that
their source code is UTF-8 encoded?

Or did you mean (2) that gnulib shall define a macro, that shall _assume_ that
the source code is UTF-8 encoded, and then expand to u8"xyz" instead of "xyz"?

Recall that the programmer is not usually telling GCC through command-line
options what the source encoding is. GCC has options -finput-charset and
-fexec-charset, but I have never seem them being used.

Also, UTF-8 is de-facto standard now: 99% of the web pages are in UTF-8,
and likely more than 95% of source code as well.

And on z/OS, users are not using GCC but the vendor compiler, which - as I
said - does not have compiler support that could reasonably be used.

For (1) to work, this macro would need to be defined in each source file,
after the #include statements - since the included headers files, possibly
from other packages, can be in a different source encoding. Few programmers
will want to do this.

For (2): what's the point? Once you assume that the source code is UTF-8
encoded, ISO C11 section 6.4.5 says that u8"xyz" and "xyz" are the same:
literals of type 'char *'.

Bruno

[1] https://gcc.gnu.org/onlinedocs/gcc-9.3.0/gcc/Preprocessor-Options.html




Re: Unicode string literals

2020-04-30 Thread Paul Eggert
On 4/30/20 2:05 PM, Marc Nieper-Wißkirchen wrote:
>> Could we have a macro to be used only in source code encoded via UTF-8?
>> Presumably the older compilers would process the bytes of the string as if 
>> they
>> were individual 8-bit characters and would pass them through unchanged, so 
>> the
>> run-time string would be UTF-8 too.

> This would allow writing a macro that prefixes "u8" to strings in
> compilers supporting enough of C11, skipping the prefix in compilers
> that pass UTF-8 encoded bytes in strings unchanged

Yes, that was the idea.

> and signal an error
> in all other cases (hopefully only very exotic platforms), right?

I wasn't thinking of requiring a diagnostic of that case, at least not reliably.
Not sure it's worth worrying about.



Re: Unicode string literals

2020-04-30 Thread Marc Nieper-Wißkirchen
Am Do., 30. Apr. 2020 um 22:54 Uhr schrieb Paul Eggert :
>
> On 4/30/20 6:08 AM, Bruno Haible wrote:
> > These not-so-new compilers don't perform
> > character set conversion; you have to provide the numeric value of each
> > byte yourself (either as escapes, or by enumerating the bytes of the
> > string one by one).
>
> Could we have a macro to be used only in source code encoded via UTF-8?
> Presumably the older compilers would process the bytes of the string as if 
> they
> were individual 8-bit characters and would pass them through unchanged, so the
> run-time string would be UTF-8 too.

This would allow writing a macro that prefixes "u8" to strings in
compilers supporting enough of C11, skipping the prefix in compilers
that pass UTF-8 encoded bytes in strings unchanged and signal an error
in all other cases (hopefully only very exotic platforms), right?



Re: Unicode string literals

2020-04-30 Thread Paul Eggert
On 4/30/20 6:08 AM, Bruno Haible wrote:
> These not-so-new compilers don't perform
> character set conversion; you have to provide the numeric value of each
> byte yourself (either as escapes, or by enumerating the bytes of the
> string one by one).

Could we have a macro to be used only in source code encoded via UTF-8?
Presumably the older compilers would process the bytes of the string as if they
were individual 8-bit characters and would pass them through unchanged, so the
run-time string would be UTF-8 too.



Re: Unicode string literals

2020-04-30 Thread Bruno Haible
Hi Marc,

> I was hoping that compilers not supporting enough of C11
> would have some other way to translate from the source file encoding
> to UTF-8, which could be exploited by Gnulib.

No, that's not the case. These not-so-new compilers don't perform
character set conversion; you have to provide the numeric value of each
byte yourself (either as escapes, or by enumerating the bytes of the
string one by one).

> > Your best bet is
> >   1) For UTF-8 encoded strings, ensure that your source code is UTF-8
> >  encoded, or use escapes, like in gnulib/tests/uniwidth/test-u8-width.c.
> 
> Using escapes for non-ASCII characters, it will work whenever the
> execution character set of the compiler is compatible with ASCII,
> right?

The only system where the execution character set is not compatible with
ASCII is z/OS. Daniel Richard G. is our expert regarding this platform.
My understanding is that
  - there are some facilities in the compiler, but we cannot make use of
them in gnulib,
  - there are some facilities in the run-time library, and Daniel knows
how to make use of them with gnulib,
  - overall it's case-by-case coding; there's no simple magic wand for it.

> > > for pre-C2x systems would be nice so that ASCII("c") expands into the
> > > ASCII code point of the character `c'.
> >
> > What's the point of this one? Why not just write 'c'?
> 
> I was thinking of a system whose execution character set is not
> compatible with ASCII.

You can have a statically allocated translation table from EBCDIC to ASCII
and write a macro that expands to ebcdic_to_ascii['c']. But that will not
be a constant expression. So, e.g. you cannot use this in a 'switch' statement.
And you cannot build a getopt option string from it either. And so on.

Bruno




Re: Unicode string literals

2020-04-30 Thread Marc Nieper-Wißkirchen
Hi Bruno,

thank you very much for your reply.

Am Do., 30. Apr. 2020 um 12:06 Uhr schrieb Bruno Haible :

[...]

> Unfortunately, we cannot provide such macros. The reason is that the
> translation from the source file's encoding to UTF-8/UTF-16/UTF-32 must
> be done by the compiler, if you want to be able to write
>   static uint8_t my_string[] = u8"Wißkirchen";

For a compiler that supports the "u8" prefix, which is defined by C11,
the compiler should do the translation from the source file encoding
to UTF-8.  I was hoping that compilers not supporting enough of C11
would have some other way to translate from the source file encoding
to UTF-8, which could be exploited by Gnulib.

> Your best bet is
>   1) For UTF-8 encoded strings, ensure that your source code is UTF-8
>  encoded, or use escapes, like in gnulib/tests/uniwidth/test-u8-width.c.

Using escapes for non-ASCII characters, it will work whenever the
execution character set of the compiler is compatible with ASCII,
right?

>   2) For UTF-16 encoded strings, which you'll need only on Windows,
>  write L"Wißkirchen". Or use hex codes, like in
>  gnulib/tests/uniwidth/test-u16-width.c.
>   3) Don't use UTF-32 encoded strings. Or use hex codes, like in
>  gnulib/tests/uniwidth/test-u32-width.c.

These two are less important for me; I mentioned them to have a full
set of macros.

>
> > Similarly, something like
> >
> > #define ASCII(s) (u8 ## s [0])
> >
> > for pre-C2x systems would be nice so that ASCII("c") expands into the
> > ASCII code point of the character `c'.
>
> What's the point of this one? Why not just write 'c'?

I was thinking of a system whose execution character set is not
compatible with ASCII. Or are those excluded in general by Gnulib?

Thanks again,

Marc



Re: Unicode string literals

2020-04-30 Thread Bruno Haible
Hi Marc,

Marc Nieper-Wißkirchen wrote:
> On a system that supports at least C11, I can create an UTF8-encoded
> literal string through:
> 
> (uint8_t const *) u8"..."
> 
> Could Gnulib abstract this into a macro so that substitutes for
> systems that do not have u8 string literals can be provided.
> 
> On a C11 system, we would have
> 
> #define UTF8(s) ((uint8_t const *) u8 ## s)
> 
> and similar definitions for UTF16 and UTF32.

Unfortunately, we cannot provide such macros. The reason is that the
translation from the source file's encoding to UTF-8/UTF-16/UTF-32 must
be done by the compiler, if you want to be able to write
  static uint8_t my_string[] = u8"Wißkirchen";

Your best bet is
  1) For UTF-8 encoded strings, ensure that your source code is UTF-8
 encoded, or use escapes, like in gnulib/tests/uniwidth/test-u8-width.c.
  2) For UTF-16 encoded strings, which you'll need only on Windows,
 write L"Wißkirchen". Or use hex codes, like in
 gnulib/tests/uniwidth/test-u16-width.c.
  3) Don't use UTF-32 encoded strings. Or use hex codes, like in
 gnulib/tests/uniwidth/test-u32-width.c.

> Similarly, something like
> 
> #define ASCII(s) (u8 ## s [0])
> 
> for pre-C2x systems would be nice so that ASCII("c") expands into the
> ASCII code point of the character `c'.

What's the point of this one? Why not just write 'c'?

Bruno




Unicode string literals

2020-04-30 Thread Marc Nieper-Wißkirchen
On a system that supports at least C11, I can create an UTF8-encoded
literal string through:

(uint8_t const *) u8"..."

Could Gnulib abstract this into a macro so that substitutes for
systems that do not have u8 string literals can be provided.

On a C11 system, we would have

#define UTF8(s) ((uint8_t const *) u8 ## s)

and similar definitions for UTF16 and UTF32.

Similarly, something like

#define ASCII(s) (u8 ## s [0])

for pre-C2x systems would be nice so that ASCII("c") expands into the
ASCII code point of the character `c'.