Re: [lazarus] UTF-8 vs UTF-16 support

Mattias Gärtner Mon, 08 Oct 2007 05:59:21 -0700

Zitat von Razvan Adrian Bogdan <[EMAIL PROTECTED]>:

> On 10/8/07, Luca Olivetti <[EMAIL PROTECTED]> wrote:
> > En/na Luca Olivetti ha escrit:
> >
> > >> You have to go through the string for UTF-8 and UTF-16 encodings so
> > >> the advantages are at least questionable...
> > >
> > > Yes, but my (wrong) premise is that you could assume all characters are
> > > 2 bytes wide, so the Nth character would be at N*2 byte.
> >
> > BTW, using strings as arrays of char to get at individual characters is
> > risky business with utf-8.


It's the same with UTF-16 and with treating UTF-16 as UCS-2. UTF-32 is almost
there. (some languages combine characters. I dont know the relevance.)

For most string operations, like computing the byte length or comparing strings
ASCII case insensitive, UTF-8 is 100% compatible.
Because of the UTF-8 encoding, you can even start in the middle of string and
find out if the byte is the first, second, third or fourth byte of a character.
So, existing algorithms don't need to change at whole to work with UTF-8. Same
is true for UCS-2 code and UTF-16.


> > Or will be they converted to (pseudo)
> > properties and (slowly) do the (slow) right thing?
> > I also suppose that the functions in strutils are not utf-8 aware, so
> > what should we be using in its place?
>
> For single character processing UTF32 (4bytes) would be nice :), i
> think functions to count UTF8 chars inside a string and getting each
> char would be nice too, maybe even implemented in FPC for UTF8string
> such as Lenght(utf8string) or indexing utf8string[1] to return the
> char not the byte as UTF32.

See lcl/lclproc.pas search for UTF8.
Some of these functions already exists in the RTL. The others may be moved
eventually.


> Since FPC uses ANSI strings, a lot and most text is in latin1 without
> any diacritics using UTF8 in Lazarus is a good choice, if the right
> functions are provided it can be a great choice unless apps become too
> slow.

In lazarus most UTF-8 code is in synedit. The synedit slow down from ASCII to
UTF-8 was hardly measurable. Even if ignoring the fact that 90%-98% of the time
is spent in the widgetset.


> Since the web uses mostly UTF8 for minimizing transfered data and also
> most databases for minimal storage size it becomes clear that UTF8 is
> a better choice if helper functions exist to assist with it's
> management.


Mattias

_________________________________________________________________
     To unsubscribe: mail [EMAIL PROTECTED] with
                "unsubscribe" as the Subject
   archives at http://www.lazarus.freepascal.org/mailarchives

Re: [lazarus] UTF-8 vs UTF-16 support

Reply via email to