RE: Proposal for 2 Byte Unicode implementation in gcc and glibc

Nuesser, Wilhelm Mon, 14 Aug 2000 23:54:21 -0700
Sorry for resending and perhaps bothering you again, but 
I�m afraid there has been a mail problem ;-)


************* original text follows
**************************************************************

Hi ,

there have been some remarks on our proposal which we�d like to comment.

1) Literals vs. library functions.
    > Jamie Lokier [[EMAIL PROTECTED]] wrote:

        >Therefore, it is good to have conversion functions between UTF-8,
UTF-16
        >and UTF-32.  It is perhaps a nice extension to have the compiler
able to
        >parse UTF-16 and UTF-32 constant strings.

        >But I don't see the point in an extensive set of printfU16
        >etc. functions.  Standard unix text files use UTF-8 (or
unfortunately
        >they are often ISO-8859-1).  Non-standard formats like databases
may use
        >UTF-16, but databases don't use printf to write to the database.


    We�d like to point out that the literals are the most interesting point.

    Reason:
        missing functions can be implemented by everyone who thinks he/she
needs them, but
        the literals and their structure must be defined by the compiler.
        
        Especially when you want to port existing code to Unicode you need a
way to
        represent your usual ( english, 7 bit, ..) string literals in
Unicode format.
            The format of the string literals determine the way other
strings are handled.

        We DO NOT want to write hebraic, arabic ... glyphs in our sources!

2) external vs. internal representation:
   > Bruno Haible [[EMAIL PROTECTED]] wrote:
        >Application writers distinguish between external representation of
        >string (how it is stored on disk) and internal representation (how
it
        >is stored in memory most of the time).

      We agree with respect to the external representation.
        
        >Given that most of the world's textual data is ISO-8859-*/KOI8-R,
        >encoding it with UTF-8 saves even more memory.
        
  This depends on the country, e.g. Japan. And consider, the variable length
     of UTF8 implies that buffer sizes must be adapted to the country the
programs
     runs (Europe 42 Bytes, Japan 84 ...) Sounds quite ugly ...


    There are good reasons why Java and IBM�s  ICU have choosen UTF16 over
any other
    implementatoin.
                
        >If you don't like that, you are free to use a middleware library
(like
        >ICU, again) which shields you from the operating system's types.
 
      Again, you�re right, but that helps you with the library, but not with
the literals, see above.

     When comparing UTF8 and UTF16 the occurence of surrogate pairs or
variable length characters      
     is much more frequent in UTF8. The information density per byte in
UTF16  is distributed 
     more constant than in UTF8.


        


3) concerning implementation details:
    > Edmund GRIMLEY EVANS [[EMAIL PROTECTED]] wrote:

        >          int isalphaU16 (utf16int_t); 

        > What does this do with a surrogate?

        - isalphaU16 should behave like the usual isalpha function in case
of a composite character, i.e. it
        should return an error if it gets the first half on a double byte
character

        >          int    mbtoU16    (utf16_t *, const char *, size_t); 

        > How does this output a surrogate pair?

        - that�s really an interesting point. Clearly here we have multi
bytes on both sides, so a simple
        return of one int would not be sufficient. We would therefore
propose the extension of the interface
        to 
                void mbtoU16 (utf16_t *, const char *, size_t, int*, int*);
        
                where one of the int* output arguments holds the bytes for
the source (read) and the other the bytes of
                the target strings (written).
                Furthermore one has to require that the utf16_t buffer
contains enough space to hold a
                composite character.
        

4) memory consumption and performance:
   > Werner LEMBERG [[EMAIL PROTECTED]] wrote:
        > > > For a number of languages, the UTF-8 representation saves some
        > > > storage when compared with UTF-16, but for Asian characters
UTF-8
        > > > requires 50% more storage than UTF-16.
        > > 
        > > Yes, it does. And for English and German UTF-16 requires 100%
more
        > > storage than UTF-8.

        > You can use SCSU to compress your data.  It works with short
strings
        > also (which is not true for generic compression algorithms like
LZW).
        > The Technical Report #6
(http://www.unicode.org/unicode/reports/tr6/)
        > gives the following examples:

        >  UTF-16 German:     9 chars  (18 Bytes)   -> SCSU   9 Bytes
      >  Russian:    6 chars  (12 Bytes)   ->        7 Bytes
      >   Japanese: 116 chars (232 Bytes)   ->      178 Bytes


        >    Werner

        - compressing and de-compressing data you�re currently working with
          clearly costs too much time and ressources. One can�t afford this.


Willi N��er                             Markus Eble
SAP LinuxLab                            SAP NLS group

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
RE: Proposal for 2 Byte Unicode implementation in gcc and glibc

Reply via email to