On Sun, 21 Apr 2002, [iso-8859-1] Andrew Dunbar wrote:

>  --- Tomas Frydrych <[EMAIL PROTECTED]>
> wrote: > 
> > > Andrew Dunbar <[EMAIL PROTECTED]> wrote:
> > 
> > > Well pretty soon we're going to need a real
> > > replacement.  Dom and I are both in favour of the
> > > replacement being UTF-8 but some here seem to want
> > > UTF-32.
> > 
> > UTF-8 is an encoding scheme that is intended to
> > allow Unicode 
> > communication between separate processes over 8-bit
> > channels. 
> > For that it is great, but that's about the only
> > thing it is really good 
> > for. UTF-8 processing is cumbersome, and as such it
> > is completely 
> > unsuitable format to use for the piecetable. We need
> > a fixed with 
> > encoding for that, such as the curent UCS-2, i.e.,
> > UTF-32.
> 
> Please back up these comments.  A lot of people,
> before
> they are familiar with Unicode and UTF-8 seem to think
> this.  I did too.  Then I read reams and reams of
> newsgroups and mailing lists and FAQs.  Now I know why
> Qt, GTK, QNX, and others use UTF-8 internally.
> People seem to think that because UTF-8 encodes
> characters as variable length runs of bytes that this
> is somehow computationally expensive to handle.  Not
> so.  You can use existing 8-bit string functions on
> it.
> It is backwards compatible with ASCII.  You can scan
> forwards and backwards effortlessly.  You can always
> tell which character in a sequence a given byte
> belongs to.
> People think random access to these strings using
> array operator will cost the earth.  Guess what - very
> little code access strings as arrays - especially in
> a Word Processor.  Of the code which does, very little
> of that needs to.  Even when you do perform lots of
> array operations on a UTF-8 string, people have done
> extensive tests showing that the cost is extremely
> negligable - look in the Unicode literature and you
> will find all this information.
> People think that UCS-2, UTF-16, or UTF-32 mean we can
> have perfect random access to strings because a
> characters is always represented as a single word or
> longword.  Not so.  UCS-2 should but this term is
> often (by Microsoft) used to refer to UTF-16.  UTF-16
> uses a mechanism called "surrogates" whereby a single
> character may need two words to represent it.  There
> goes your free array access.  Even UTF-32 is not safe
> from this.  Because Unicode requires "combining
> characters".  This means that "�" may be represented
> as "a" followed by a non-spacing "�" acute accent.
> Some people think this is also silly.  These people
> need to go read all about Unicode before they embark
> on seriously multilingual software.  Vietnames is
> possible to support without combining characters but
> you won't be able to view the results because no
> Vietnames fonts exist that work this way - they all
> expect to use combining characters.  Thai needs them.
> Hindi needs them.  All Indian/Indic languages need
> them.
> 
> So to sum up, the two arguments not to use UTF-8
> internally are:
> 
> 1) Array access is too slow.
> 
> - This is not true and it is seldom needed.
> 
> 2) UTF-8 means you have to handle a series of values
>    for a single on-screen character.
> 
> - *All* Unicode encodings need this anyway!
> 
> But look around the internet for better arguments and
> better written arguments.
> 


UTF-8 is great for communicating between the piecetable and the widgets. I
think we should definately do this. What I don't want is for us to store
our text as UTF-8 in the piecetable. We have a *LOT* of code that expects
that every position in the piecetable corresponds to an extra letter of
text. 

What I think we should do is store our unicode as UT_uint32 in the
piecetable which can then be randomly accessed the same way we do things
now.

We just need to make a global UT_UCSChar => UT_UCSChar32 (==
UT_uint32) plus some hardwired fixes for UT_unit16 about the place and
some routines to do UT_UCSChar32 +> UTF_8 conversion for transporting
unicode to the screen.

By the way, glib2.0 (which we need for pango) has a unicode variable which
is is just type UT_uint32 and plenty of UT_uint32 <=> UTF_8 conversion and
conveince routines. 

Cheers

Martin


Reply via email to