Re: UTF16 and GCC

Bruno Haible Thu, 12 Jul 2001 11:24:43 -0700
Christoph Rohland writes:

> >                     u"UTF-8 string literal"
> > 
> > This way no extra 16-bit string functions are needed - the 8-bit
> > str* functions in libc will do.
> 
> Why do you need a special utf8 string literal? UTF8 can be based on
> standard string literals since in the ACSII range it is the same and
> the basic entity is 8bit.

If we design such a feature like u"..." it ought to be usable for
non-ASCII characters as well (such as the quote characters contained
in your .doc file). And C 99 doesn't provide for a way to reliably
produce UTF-8 strings, other than hex or octal escapes:
"\xe2\x82\xac". Thus it is the same problem as you are having, and
merits to be solved the same way.

AFAIK, gcc will by default assume that source files are in UTF-8 if no
"-*- coding: XXX -*-" signature is present at the top. But that
doesn't solve the problem when this "coding:" signature is given - in
that case we wish that the compiler converts the u"..." strings from
the given encoding to Unicode -, and it doesn't work for other
compilers than gcc.

> we will do that after the discussion if the _feature_ is welcome.

I will welcome it if

  1) There are similar facilities for UTF-8 and UCS-4 encoded strings.

  2) A library API for elementary string manipulations on such strings
     (both for UTF-16 and UCS-4) gets standardized. ISO C 99 wchar_t
     APIs are not well usable in practice because you don't know
     what wchar_t is.

> For C++, is utf16_t special like wchar_t, or a typedef?

Good question.

> Are the strings NUL-terminated?

Yes, just like wchar_t string literals are L'\0' terminated.

> In C++, is there a deprecated conversion to a pointer to a
> non-const-qualified type?

No. This conversion to 'char *' exists in C++ only because some
functions like exec() take 'char *' arrays.

> What arrays can be initialised from these strings?

utf8_t[], utf16_t[] and ucs4_t[] respectively.

> Do they concatenate with each other

Yes,

> with narrow strings; with wide strings; and what sort of strings
> result?

This is weird; "a" L"b" is invalid, right?

> Is the quiet change to interpretation of programs in which u is a
> macro and is immediately followed by a string literal justified, or
> should the specification use a macro defined in a header to form
> these string literals?

The effect cannot be achieved by pure macrology. But it makes sense to
activate it only if 'u' is *not* defined as a macro.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
Re: UTF16 and GCC

Reply via email to