Re: Proposal for 2 Byte Unicode implementation in gcc and glibc

Brink, Ulrich Tue, 08 Aug 2000 01:27:25 -0700


> Bruno Haible writes:
> > Wilhelm Nuesser writes:
> >
> >>    2.String and character literals 
> >>  
> >>       For utf16_t literals, we suggest the prefix u (similar to the
> >>       prefix L for the type wchar_t):
> >>  
> >>          utf16_t s[] = u"someText"; 
> >>          utf16_t c = u's'; 
> >>  
> >
> > The need for this language extension that you propose here - namely,
> > being able to view and edit source code on non-Unicode text editors -
> > is already fulfilled by the ISO C 99 / ISO C++ "\uxxxx" and L"\uxxxx"
> > feature. The problem is that wchar_t is not guaranteed to represent
> > Unicode is irrelevant, because such programs will work in a given
> > locale only, anyway. For writing international software, I don't
> > recommend to put foreign strings in the code. Put them into message
> > catalogs and use gettext().
> 
> 
> 16-bit Unicode is being used in existing software. Java is 16-bit
> Unicode.  On AIX and Windows NT, wchar_t has 16 bits.  The template
> class basic_string in C++ is designed to be instantiated with various
> types.  - With our proposal, we leave it to the developer to decide
> which Unicode representation fits best to his needs.
> 
> There are libraries for platform-independent 16-bit Unicode support.
> You mentioned ICU. But there are no literals. The programmer has to
> write something like
> 
>   unsigned short s[] = {'H', 'e', 'l', 'l', 'o', 0 };
>   myfunc( (unsigned short*)"H\000e\000l\000l\000o\000\000" );
> 
> Of course, in internationalized applications the texts that are
> displayed to the users should be translated and should not be coded in
> the C source. Nevertheless literals are frequently used for various
> internal purposes.
> 
> The latest gcc has an option that makes wchar_t 16 bits long.  However
> there is the danger that you mix up objects compiled with 16-bit
> wchar_t and objects with 32-bit wchar_t, and as far as I know, it is
> not planned to create a glibc with 16-bit wchar_t. So we would prefer
> to work without the new option and to have a new type for 16-bit
> characters.
> 
> In the glibc the char and wchar_t versions of some functions (e.g.
> strtol(), strcoll() ) are generated from the same source. It would not
> be too difficult to generate a 16-bit version as well, and the result
> would be more reliable than an independent library. In practice, one
> of the problems we have is that we must migrate old C code to Unicode.
> 
> Finally let me point out that the literals are more important for us
> than the library issue.
> 
> Yours,
> Ulli
> 
> 
> ----------
> From:         Bruno Haible[SMTP:[EMAIL PROTECTED]]
> Sent:         Friday, August 04, 2000 8:00:59 PM
> To:   [EMAIL PROTECTED]
> Cc:   '[EMAIL PROTECTED]'; '[EMAIL PROTECTED]'; Nuesser, Wilhelm; Rohland,
> Hans-Christoph
> Subject:      Re: Proposal for 2 Byte Unicode implementation in gcc and
> glibc
> Auto forwarded by a Rule
> 
> Wilhelm Nuesser writes:
> 
> > One simple example: for a typical database used in medium sized
> companies of
> > about 100 GB, we find a ratio of about 70 percent strings to 30 percent
> > data. The transition to 2 byte Unicode would increase the disk space to
> > (2*70 + 30) % = 170 %. If we change to 4 byte Unicode the same database
> > would increase by 310 %.
> 
> Application writers distinguish between external representation of
> string (how it is stored on disk) and internal representation (how it
> is stored in memory most of the time).
> 
> About the external representation:
> 
> * Noone uses UCS-4/UTF-32. It's just too wasteful.
> 
> * Many Windows applications use UCS-2 or UTF-16.
> 
> * Many Unix applications use UTF-8.
> 
> * The particular choice for your applications is up to you. Support
>   for all of them is available in glibc-2.1.92, through iconv
>   (explicit conversion) or fopen/fgetwc/fputwc (implicit conversion).
> 
> About the internal representation:
> 
> * Many applications use UTF-8 as internal representation, because it
>   does not waste a lot of memory for American and European languages.
> 
> * For some complicated tasks, like string pattern matching, temporary
>   conversion to UCS-4 is performed, using mbsnrtowcs or equivalent.
> 
> * For some simpler tasks, like determining the width of a string,
>   often the conversion to UCS-4 is performed on the fly, using
>   mbrtowc, without need for memory allocation.
> 
> * The ISO C 99 standard and its glibc-2.2 implementation offer its
>   entire printf/scanf/IO facilities in both the multibyte (possibly
>   UTF-8) and wide (UCS-4 on glibc) flavours.
> 
> * Again, the choice is up to you. If you absolutely want the third
>   flavour (UTF-16 as in-memory representation), libraries like ICU
>   give it to you.
> 
> > These are reasons to use UTF-16: 
> >  
> >     1.Performance
> >  
> >       The UTF-16 representation of textual data needs only half the
> >       amount of memory that a 32-bit representation would need, provided
> >       that surrogate pairs occur only seldom, which will be the
> >       case.
> 
> Given that most of the world's textual data is ISO-8859-*/KOI8-R,
> encoding it with UTF-8 saves even more memory.
> 
> >     2.Portability 
> >  
> >       Software that uses wchar_t has restricted portability since
> >       wchar_t sometimes has 32 bits, but sometimes only 16 bits. A
> >       dedicated type for Unicode with platform-independent length allows
> >       to write portable software.
> 
> Writing portable programs means to realize what is implementation
> dependent and what is not. Yes, sizeof(wchar_t) is implementation
> dependent.
> 
> If you don't like that, you are free to use a middleware library (like
> ICU, again) which shields you from the operating system's types.
> 
> >     6.Operations and representation of character strings 
> >       
> >       Although UTF-32 makes some operations on characters easier
> >       (e.g. indexing into strings) this implementation leads to a great
> >       overhead in other areas (see searching, collating, displaying etc.
> >       where the whole string is involved).
> 
> In any of these areas (searching, collating, displaying) you can
> afford to temporarily convert from UTF-8 or UTF-16 to UCS-4, because
> the actual work involved (canonical [de]composition, treatment of
> combining characters, reordering of vowels, etc) is far superior to
> the conversion cost.
> 
> > For a number of languages, the UTF-8 representation saves some
> > storage when compared with UTF-16, but for Asian characters UTF-8
> > requires 50% more storage than UTF-16.
> 
> Yes, it does. And for English and German UTF-16 requires 100% more
> storage than UTF-8.
> 
> > We do not consider UTF-8 as advantageous for text representation in
> > the memory. It may be well suited for files where access is
> > sequential but in general it is no uni-versal solution.
> 
> Whether the access is sequential or random is irrelevant here. When
> doing random access into an UTF-16 encoded string, a program must not
> process the second half of a surrogate pair before the first half, and
> likewise it normally must not process a combining character before its
> preceding base character. Therefore - whether in a UTF-32, UTF-16 or
> UTF-8 world - random access into strings is done via substrings
> (ranges of indices, not singular indices), and then it doesn't matter
> any more whether the substrings are delimited by two "uint32_t *" or
> two "uint16_t *" or two "uint8_t *".
> 
> >    2.String and character literals 
> >  
> >       For utf16_t literals, we suggest the prefix u (similar to the
> >       prefix L for the type wchar_t):
> >  
> >          utf16_t s[] = u"someText"; 
> >          utf16_t c = u's'; 
> >  
> >       For utf32_t, we suggest the prefix U. This is similar to the
> >       notation for universal character names in the C++ Standard: \u is
> >       followed by four hexadecimal digits and \U is followed by eight
> >       hexadecimal digits.
> 
> The need for this language extension that you propose here - namely,
> being able to view and edit source code on non-Unicode text editors -
> is already fulfilled by the ISO C 99 / ISO C++ "\uxxxx" and L"\uxxxx"
> feature. The problem is that wchar_t is not guaranteed to represent
> Unicode is irrelevant, because such programs will work in a given
> locale only, anyway. For writing international software, I don't
> recommend to put foreign strings in the code. Put them into message
> catalogs and use gettext().
> 
>                           Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Proposal for 2 Byte Unicode implementation in gcc and glibc

Reply via email to