Christoph Rohland writes:
> > u"UTF-8 string literal"
> >
> > This way no extra 16-bit string functions are needed - the 8-bit
> > str* functions in libc will do.
>
> Why do you need a special utf8 string literal? UTF8 can be based on
> standard string literals since in the ACSII range it is the same and
> the basic entity is 8bit.
If we design such a feature like u"..." it ought to be usable for
non-ASCII characters as well (such as the quote characters contained
in your .doc file). And C 99 doesn't provide for a way to reliably
produce UTF-8 strings, other than hex or octal escapes:
"\xe2\x82\xac". Thus it is the same problem as you are having, and
merits to be solved the same way.
AFAIK, gcc will by default assume that source files are in UTF-8 if no
"-*- coding: XXX -*-" signature is present at the top. But that
doesn't solve the problem when this "coding:" signature is given - in
that case we wish that the compiler converts the u"..." strings from
the given encoding to Unicode -, and it doesn't work for other
compilers than gcc.
> we will do that after the discussion if the _feature_ is welcome.
I will welcome it if
1) There are similar facilities for UTF-8 and UCS-4 encoded strings.
2) A library API for elementary string manipulations on such strings
(both for UTF-16 and UCS-4) gets standardized. ISO C 99 wchar_t
APIs are not well usable in practice because you don't know
what wchar_t is.
> For C++, is utf16_t special like wchar_t, or a typedef?
Good question.
> Are the strings NUL-terminated?
Yes, just like wchar_t string literals are L'\0' terminated.
> In C++, is there a deprecated conversion to a pointer to a
> non-const-qualified type?
No. This conversion to 'char *' exists in C++ only because some
functions like exec() take 'char *' arrays.
> What arrays can be initialised from these strings?
utf8_t[], utf16_t[] and ucs4_t[] respectively.
> Do they concatenate with each other
Yes,
> with narrow strings; with wide strings; and what sort of strings
> result?
This is weird; "a" L"b" is invalid, right?
> Is the quiet change to interpretation of programs in which u is a
> macro and is immediately followed by a string literal justified, or
> should the specification use a macro defined in a header to form
> these string literals?
The effect cannot be achieved by pure macrology. But it makes sense to
activate it only if 'u' is *not* defined as a macro.
Bruno
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/