Re: UTF16 and GCC

Bruno Haible Wed, 11 Jul 2001 09:30:02 -0700

Joseph S. Myers writes:

> since there is no documentation I can't tell what the patch is
> supposed to do.

Their documentation is in the same directory but in a proprietary
format; you find it appended below.

> Systems for string literals in specified character sets have been
> discussed on the WG14 reflector, but AFAICT without any working papers yet
> even in the WG14 document register

At least their patch is something in the direction of portably
written and still legible multilingual strings.

The L"wide string literal" syntax suffers from non-portability across
systems and across locales, because ISO C fails to mandate that
wchar_t is 32-bit ISO 10646.

But the 'u' prefix would better be used for UTF-8 string literals, not
UTF-16 string literals. So I'm proposing the following syntax

                    u"UTF-8 string literal"

This way no extra 16-bit string functions are needed - the 8-bit str*
functions in libc will do.

Bruno

Title: Modification of GNU-Compiler to support UTF16-String literals

Modification of GNU-Compiler to support UTF16-String literals

Purpose

Support u‘c’ and u“UTF16 string literal” analogue to L‘c’ and L“wide string literal”.

Specification

(see chapter 6.1.4 “String literals” of the C89 standard)

u-string-literal:

u“s-char-sequence_opt”

s-char-sequence:

s-char

s-char-sequence s-char

s-char:

any member of the source character set except

the double-quote “, backslash \, or new-line character

escape-sequence

Implementation

Our approach was to search all places where L-literals where handled explicitly and to add analogue coding for UTF16 string literals.

Step 1: Scanning u-literals

We identified the place in the compiler where L-literals where handled by the scanner and added analogue handling for the u-literals.

As type for the scanned u-literals we used an alias of type unsigned short or unsigned short array, respectively.

Step 2: Parser

In the parser there was only one location where special handling of L-literals occurred:

the routine that concatenated several strictly adjacent literals into one literal.

Step 3: Semantic analysis

In the semantic analysis we found three places where special handling of L-literals occurred:

the check, if char/wide char pointers are initialized with a string literal of the proper type
the permission to convert string literals implicitly to non-const char/wide char pointer
the initialization of string arrays without the terminating 0 of the string literal (only allowed in C)

Step 4: Code generation

In the code generation and optimization there was no location, where special handling of L-literals occurred.