Zitat von Razvan Adrian Bogdan <[EMAIL PROTECTED]>: > On 10/8/07, Luca Olivetti <[EMAIL PROTECTED]> wrote: > > En/na Luca Olivetti ha escrit: > > > > >> You have to go through the string for UTF-8 and UTF-16 encodings so > > >> the advantages are at least questionable... > > > > > > Yes, but my (wrong) premise is that you could assume all characters are > > > 2 bytes wide, so the Nth character would be at N*2 byte. > > > > BTW, using strings as arrays of char to get at individual characters is > > risky business with utf-8.
It's the same with UTF-16 and with treating UTF-16 as UCS-2. UTF-32 is almost there. (some languages combine characters. I dont know the relevance.) For most string operations, like computing the byte length or comparing strings ASCII case insensitive, UTF-8 is 100% compatible. Because of the UTF-8 encoding, you can even start in the middle of string and find out if the byte is the first, second, third or fourth byte of a character. So, existing algorithms don't need to change at whole to work with UTF-8. Same is true for UCS-2 code and UTF-16. > > Or will be they converted to (pseudo) > > properties and (slowly) do the (slow) right thing? > > I also suppose that the functions in strutils are not utf-8 aware, so > > what should we be using in its place? > > For single character processing UTF32 (4bytes) would be nice :), i > think functions to count UTF8 chars inside a string and getting each > char would be nice too, maybe even implemented in FPC for UTF8string > such as Lenght(utf8string) or indexing utf8string[1] to return the > char not the byte as UTF32. See lcl/lclproc.pas search for UTF8. Some of these functions already exists in the RTL. The others may be moved eventually. > Since FPC uses ANSI strings, a lot and most text is in latin1 without > any diacritics using UTF8 in Lazarus is a good choice, if the right > functions are provided it can be a great choice unless apps become too > slow. In lazarus most UTF-8 code is in synedit. The synedit slow down from ASCII to UTF-8 was hardly measurable. Even if ignoring the fact that 90%-98% of the time is spent in the widgetset. > Since the web uses mostly UTF8 for minimizing transfered data and also > most databases for minimal storage size it becomes clear that UTF8 is > a better choice if helper functions exist to assist with it's > management. Mattias _________________________________________________________________ To unsubscribe: mail [EMAIL PROTECTED] with "unsubscribe" as the Subject archives at http://www.lazarus.freepascal.org/mailarchives