> Bruno Haible writes:
> > Wilhelm Nuesser writes:
> >
> >> 2.String and character literals
> >>
> >> For utf16_t literals, we suggest the prefix u (similar to the
> >> prefix L for the type wchar_t):
> >>
> >> utf16_t s[] = u"someText";
> >> utf16_t c = u's';
> >>
> >
> > The need for this language extension that you propose here - namely,
> > being able to view and edit source code on non-Unicode text editors -
> > is already fulfilled by the ISO C 99 / ISO C++ "\uxxxx" and L"\uxxxx"
> > feature. The problem is that wchar_t is not guaranteed to represent
> > Unicode is irrelevant, because such programs will work in a given
> > locale only, anyway. For writing international software, I don't
> > recommend to put foreign strings in the code. Put them into message
> > catalogs and use gettext().
>
>
> 16-bit Unicode is being used in existing software. Java is 16-bit
> Unicode. On AIX and Windows NT, wchar_t has 16 bits. The template
> class basic_string in C++ is designed to be instantiated with various
> types. - With our proposal, we leave it to the developer to decide
> which Unicode representation fits best to his needs.
>
> There are libraries for platform-independent 16-bit Unicode support.
> You mentioned ICU. But there are no literals. The programmer has to
> write something like
>
> unsigned short s[] = {'H', 'e', 'l', 'l', 'o', 0 };
> myfunc( (unsigned short*)"H\000e\000l\000l\000o\000\000" );
>
> Of course, in internationalized applications the texts that are
> displayed to the users should be translated and should not be coded in
> the C source. Nevertheless literals are frequently used for various
> internal purposes.
>
> The latest gcc has an option that makes wchar_t 16 bits long. However
> there is the danger that you mix up objects compiled with 16-bit
> wchar_t and objects with 32-bit wchar_t, and as far as I know, it is
> not planned to create a glibc with 16-bit wchar_t. So we would prefer
> to work without the new option and to have a new type for 16-bit
> characters.
>
> In the glibc the char and wchar_t versions of some functions (e.g.
> strtol(), strcoll() ) are generated from the same source. It would not
> be too difficult to generate a 16-bit version as well, and the result
> would be more reliable than an independent library. In practice, one
> of the problems we have is that we must migrate old C code to Unicode.
>
> Finally let me point out that the literals are more important for us
> than the library issue.
>
> Yours,
> Ulli
>
>
> ----------
> From: Bruno Haible[SMTP:[EMAIL PROTECTED]]
> Sent: Friday, August 04, 2000 8:00:59 PM
> To: [EMAIL PROTECTED]
> Cc: '[EMAIL PROTECTED]'; '[EMAIL PROTECTED]'; Nuesser, Wilhelm; Rohland,
> Hans-Christoph
> Subject: Re: Proposal for 2 Byte Unicode implementation in gcc and
> glibc
> Auto forwarded by a Rule
>
> Wilhelm Nuesser writes:
>
> > One simple example: for a typical database used in medium sized
> companies of
> > about 100 GB, we find a ratio of about 70 percent strings to 30 percent
> > data. The transition to 2 byte Unicode would increase the disk space to
> > (2*70 + 30) % = 170 %. If we change to 4 byte Unicode the same database
> > would increase by 310 %.
>
> Application writers distinguish between external representation of
> string (how it is stored on disk) and internal representation (how it
> is stored in memory most of the time).
>
> About the external representation:
>
> * Noone uses UCS-4/UTF-32. It's just too wasteful.
>
> * Many Windows applications use UCS-2 or UTF-16.
>
> * Many Unix applications use UTF-8.
>
> * The particular choice for your applications is up to you. Support
> for all of them is available in glibc-2.1.92, through iconv
> (explicit conversion) or fopen/fgetwc/fputwc (implicit conversion).
>
> About the internal representation:
>
> * Many applications use UTF-8 as internal representation, because it
> does not waste a lot of memory for American and European languages.
>
> * For some complicated tasks, like string pattern matching, temporary
> conversion to UCS-4 is performed, using mbsnrtowcs or equivalent.
>
> * For some simpler tasks, like determining the width of a string,
> often the conversion to UCS-4 is performed on the fly, using
> mbrtowc, without need for memory allocation.
>
> * The ISO C 99 standard and its glibc-2.2 implementation offer its
> entire printf/scanf/IO facilities in both the multibyte (possibly
> UTF-8) and wide (UCS-4 on glibc) flavours.
>
> * Again, the choice is up to you. If you absolutely want the third
> flavour (UTF-16 as in-memory representation), libraries like ICU
> give it to you.
>
> > These are reasons to use UTF-16:
> >
> > 1.Performance
> >
> > The UTF-16 representation of textual data needs only half the
> > amount of memory that a 32-bit representation would need, provided
> > that surrogate pairs occur only seldom, which will be the
> > case.
>
> Given that most of the world's textual data is ISO-8859-*/KOI8-R,
> encoding it with UTF-8 saves even more memory.
>
> > 2.Portability
> >
> > Software that uses wchar_t has restricted portability since
> > wchar_t sometimes has 32 bits, but sometimes only 16 bits. A
> > dedicated type for Unicode with platform-independent length allows
> > to write portable software.
>
> Writing portable programs means to realize what is implementation
> dependent and what is not. Yes, sizeof(wchar_t) is implementation
> dependent.
>
> If you don't like that, you are free to use a middleware library (like
> ICU, again) which shields you from the operating system's types.
>
> > 6.Operations and representation of character strings
> >
> > Although UTF-32 makes some operations on characters easier
> > (e.g. indexing into strings) this implementation leads to a great
> > overhead in other areas (see searching, collating, displaying etc.
> > where the whole string is involved).
>
> In any of these areas (searching, collating, displaying) you can
> afford to temporarily convert from UTF-8 or UTF-16 to UCS-4, because
> the actual work involved (canonical [de]composition, treatment of
> combining characters, reordering of vowels, etc) is far superior to
> the conversion cost.
>
> > For a number of languages, the UTF-8 representation saves some
> > storage when compared with UTF-16, but for Asian characters UTF-8
> > requires 50% more storage than UTF-16.
>
> Yes, it does. And for English and German UTF-16 requires 100% more
> storage than UTF-8.
>
> > We do not consider UTF-8 as advantageous for text representation in
> > the memory. It may be well suited for files where access is
> > sequential but in general it is no uni-versal solution.
>
> Whether the access is sequential or random is irrelevant here. When
> doing random access into an UTF-16 encoded string, a program must not
> process the second half of a surrogate pair before the first half, and
> likewise it normally must not process a combining character before its
> preceding base character. Therefore - whether in a UTF-32, UTF-16 or
> UTF-8 world - random access into strings is done via substrings
> (ranges of indices, not singular indices), and then it doesn't matter
> any more whether the substrings are delimited by two "uint32_t *" or
> two "uint16_t *" or two "uint8_t *".
>
> > 2.String and character literals
> >
> > For utf16_t literals, we suggest the prefix u (similar to the
> > prefix L for the type wchar_t):
> >
> > utf16_t s[] = u"someText";
> > utf16_t c = u's';
> >
> > For utf32_t, we suggest the prefix U. This is similar to the
> > notation for universal character names in the C++ Standard: \u is
> > followed by four hexadecimal digits and \U is followed by eight
> > hexadecimal digits.
>
> The need for this language extension that you propose here - namely,
> being able to view and edit source code on non-Unicode text editors -
> is already fulfilled by the ISO C 99 / ISO C++ "\uxxxx" and L"\uxxxx"
> feature. The problem is that wchar_t is not guaranteed to represent
> Unicode is irrelevant, because such programs will work in a given
> locale only, anyway. For writing international software, I don't
> recommend to put foreign strings in the code. Put them into message
> catalogs and use gettext().
>
> Bruno
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/