Re: UTF16 and GCC

Christoph Rohland Thu, 12 Jul 2001 07:32:25 -0700

Hi Joseph,

On Thu, 12 Jul 2001, Joseph S. Myers wrote:
> On Thu, 12 Jul 2001, Nuesser, Wilhelm wrote:
> 
>> There is documentation:
>> ftp://ftp.sap.com/pub/i18N/utf16/ugcc-2.95.2/U_literal_in_GCC.doc
>> Please have a look at it, although it is MS Word ....
> 
> MS Word is not a reasonable format for free software documentation.

I agree that that's not an appropriate format, and we will put a plain
text document on the ftp server as replacement tomorrow. I attached
the plain text version to this mail.

> In this case, documentation should be in the form of patches to
> GCC's Texinfo manual, included in the patch itself and distributable
> (as GCC's manual is) under the GNU Free Documentation License.

Yes, we will do that after the discussion if the _feature_ is welcome.

> Read through all the discussions in the past couple of months on
> gcc-patches about Apple's attempt to contribute support for Pascal
> string literals to GCC.  This should give some idea of the issues
> you need to address in the documentation and testcases.
> 
> As far as I am concerned, all GCC patches should come with thorough
> documentation, testcases that cover every line of code added or
> changed as far as reasonably practicable, and should fully following
> the GNU Coding Standards, the GCC coding conventions and the other
> instructions for contributing; and if they don't, I will
> preferentially comment on the lack of these rather than on
> substantive issues of design and implementation, since these
> guidelines are designed to make code easier to read, understand and
> comment on the substance of.

Hey, could we please first discuss the content and later the form? 

We did not approach anybody to include our coding and documentation in
the given form into the main gcc sources. 

I am a little bit amazed that the tone in the discussion is somewhat
hostile IMHO. We simply put a patch to the gcc on our ftp
server. Publishing patches is welcome in the greatest part of the free
software community.

Actually we simply provided our internal version for a basis for
discussion. I would be more interested in the discussion if this an
accepted feature than getting bashed for some formal requirements.

If we agree to include this feature into gcc we will happily work
together with you guys to make this is good improvement with proper
documentation and test cases.

Back to content:

>> Oops, no, we _don�t_ want to write arbitrary char literals in our
>> code. We do not write NON-Ascii chars in our code, we will stick to
>> pure ascii! But we need another _internal_ presentation of strings
>> in memory during runtime, for example for comparing a user given
>> string with other information inside our application.
> 
> You would still, at the very least, need to ensure that UCNs (\u and
> \U) in your UTF-16 strings end up appropriately encoded in the
> binary (as single UTF-16 characters or as surrogate pairs, depending
> on the value specified).

Yes, that's probably what a full solution would need. In our special
case we are mainly interested to encode ASCII string literals in
UTF16. But you are probably right: The UCNs should work also.

A question here about standard string literals and UTF8 (I am really
no specialist about Unicode et al): What hinders anybody to do UCN in
standard C string literals and UTF8 encoding?

Greetings
                Christoph


Modification of GNU-Compiler to support UTF16-String literals

Purpose

Support u'c' and u"UTF16 string literal" analogue to L'c' and L"wide
string literal".

Specification
(see chapter 6.1.4 "String literals" of the C89 standard)

u-string-literal:       
u"s-char-sequenceopt"

s-char-sequence:
        s-char
        s-char-sequence s-char

s-char:
        any member of the source character set except the double-quote ",
        backslash \, or new-line character escape-sequence


Implementation

Our approach was to search all places where L-literals where handled
explicitly and to add analogue coding for UTF16 string literals.

Step 1: Scanning u-literals

We identified the place in the compiler where L-literals where handled
by the scanner and added analogue handling for the u-literals.  As
type for the scanned u-literals we used an alias of type unsigned
short or unsigned short array, respectively.

Step 2: Parser

In the parser there was only one location where special handling of
L-literals occurred: the routine that concatenated several strictly
adjacent literals into one literal.

Step 3: Semantic analysis

In the semantic analysis we found three places where special handling
of L-literals occurred: 
- the check, if char/wide char pointers are initialized with a string
  literal of the proper type
- the permission to convert string literals implicitly to non-const
  char/wide char pointer
- the initialization of string arrays without the terminating 0 of the
  string literal (only allowed in C)

Step 4: Code generation

In the code generation and optimization there was no location, where
special handling of L-literals occurred.

Re: UTF16 and GCC

Reply via email to