Re: 3rd-party cross-platform UTF-8 support
From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED] Why would UTF-16 be easier for internal processing than UTF-8? Both are variable-length encodings. Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon that it's OK for them take a bail-out slow path. Andy Heninger IBM, Cupertino, CA [EMAIL PROTECTED]
Re: 3rd-party cross-platform UTF-8 support
From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED] Why would UTF-16 be easier for internal processing than UTF-8? Both are variable-length encodings. Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon that it's OK for them take a bail-out slow path. Andy Heninger IBM, Cupertino, CA [EMAIL PROTECTED]
Re: 3rd-party cross-platform UTF-8 support
From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED] Why would UTF-16 be easier for internal processing than UTF-8? Both are variable-length encodings. Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon that it's OK for them take a bail-out slow path. Andy Heninger IBM, Cupertino, CA [EMAIL PROTECTED]
Re: 3rd-party cross-platform UTF-8 support
From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED] Why would UTF-16 be easier for internal processing than UTF-8? Both are variable-length encodings. Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon that it's OK for them take a bail-out slow path. Andy Heninger IBM, Cupertino, CA [EMAIL PROTECTED]
Re: 3rd-party cross-platform UTF-8 support
Andy Heninger writes: Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon that it's OK for them take a bail-out slow path. Sure, but if you are using UTF-16 (or any other multibyte encoding) you loose the ability to index characters in an array in constant time. For some applications that isn't desirable. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com Beware the lollipop of mediocrity: lick it once and you suck forever
RE: 3rd-party cross-platform UTF-8 support
Tom, Andy Heninger writes: Performance tuning is easier with UTF-16. You can optimize for BMP characters, knowing that surrogate pairs are sufficiently uncommon that it's OK for them take a bail-out slow path. Sure, but if you are using UTF-16 (or any other multibyte encoding) you loose the ability to index characters in an array in constant time. For some applications that isn't desirable. If you implement an array that is directly indexed by Unicode code point it would have to have 1114111 entries. (I love the number) I don't think that many applications can afford to have over a megabyte of storage per byte of table width. If nothing else it would be an array of addresses pointing to valid entries that would take about 4.5 MB. Because the new plains are sparsely populated you can segment your table. In this case you have no real advantage using UTF-32. I though that Basis Technology was developed using UCS-2. Have you converted to full UTF-16 support or are you thinking of changing? Carl
RE: 3rd-party cross-platform UTF-8 support
Carl W. Brown writes: If you implement an array that is directly indexed by Unicode code point it would have to have 1114111 entries. (I love the number) I don't think that many applications can afford to have over a megabyte of storage per byte of table width. If nothing else it would be an array of addresses pointing to valid entries that would take about 4.5 MB. Because the new plains are sparsely populated you can segment your table. In this case you have no real advantage using UTF-32. That wasn't my point: obviously one would not create a lookup table using raw Unicode values. But if I have a text string, and that string is encoded in UTF-16, and I want to access Unicode character values, then I cannot index that string in constant time. To find character n I have to walk all of the 16-bit values in that string accounting for surrogates. If I use UTF-32 I don't need to do that. This very issue came up during the discussion of how to handle surrogates in Python. I though that Basis Technology was developed using UCS-2. Have you converted to full UTF-16 support or are you thinking of changing? The current shipping version of Rosette uses UCS-2 internally. Current development is focusing on UTF-16 and UTF-32 support. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com Beware the lollipop of mediocrity: lick it once and you suck forever
Re: 3rd-party cross-platform UTF-8 support
From: Tom Emerson [EMAIL PROTECTED] But if I have a text string, and that string is encoded in UTF-16, and I want to access Unicode character values, then I cannot index that string in constant time. To find character n I have to walk all of the 16-bit values in that string accounting for surrogates. If I use UTF-32 I don't need to do that. This very issue came up during the discussion of how to handle surrogates in Python. Would this not be the same issue for composite characters, even *in* UTF-32? If you truly mean to work with characters here then it seems this is a problem you can always have. MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: 3rd-party cross-platform UTF-8 support
Michael \(michka\) Kaplan writes: To find character n I have to walk all of the 16-bit values in that string accounting for surrogates. If I use UTF-32 I don't need to do that. This very issue came up during the discussion of how to handle surrogates in Python. Would this not be the same issue for composite characters, even *in* UTF-32? Yes, absolutely. However, in the case of Python we were concerned with being able to access a surrogate as a valid assigned single character. If you truly mean to work with characters here then it seems this is a problem you can always have. Of course. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com Beware the lollipop of mediocrity: lick it once and you suck forever
Re: 3rd-party cross-platform UTF-8 support
Thu, 20 Sep 2001 12:46:49 -0700 (PDT), Kenneth Whistler [EMAIL PROTECTED] pisze: If you are expecting better performance from a library that takes UTF-8 API's and then does all its internal processing in UTF-8 *without* converting to UTF-16, then I think you are mistaken. UTF-8 is a bad form for much of the kind of internal processing that ICU has to do for all kinds of things -- particularly for collation weighting, for example. Any library worth its salt would *first* convert to UTF-16 (or UTF-32) internally, anyway, before doing any significant semantic manipulation of the characters. Why would UTF-16 be easier for internal processing than UTF-8? Both are variable-length encodings. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: 3rd-party cross-platform UTF-8 support
From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED] Why would UTF-16 be easier for internal processing than UTF-8? Both are variable-length encodings. Good straw man! Working with UTF-16 is immensely easier than working with UTF-8. As I am am sure you know! :-) MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/
Re: 3rd-party cross-platform UTF-8 support
I would like to add that ICU 2.0 (in a few weeks) will have convenience functions for in-process string transformations: UTF-16 - UTF-8 UTF-16 - UTF-32 UTF-16 - wchar_t* markus
Re: 3rd-party cross-platform UTF-8 support
Mozilla also use Unicode internally and are cross platform. [EMAIL PROTECTED] wrote: For cross-platform software (NT,Solaris,HP,AIX), the only 3rd-party unicode support I found so far is IBM ICU. It's a very good support for cross-platform software internationalization. However, ICU internally uses UTF-16, For our application using UTF-8 as input and output, I have to convert from UTF-8 to UTF-16, before calling ICU functions (such as ucol_strcoll() ) I'm worried about the performance overhead of this conversion. Then... use Unicode internally in your software regardless you use UTF-8 or UCS2 as the data type in the interface, eventually some code need to convert it to UCS2 for most of the processing. Unless you use UCS2 internally, you need to pay for the performance, either inside the library our in your own code. Are there any other cross-platform 3rd party unicode supports with better UTF-8 handling ? Thanks a lot. -Changjian Sun
Re: 3rd-party cross-platform UTF-8 support
Markus Scherer wrote: I would like to add that ICU 2.0 (in a few weeks) will have convenience functions for in-process string transformations: UTF-16 - UTF-8 UTF-16 - UTF-32 UTF-16 - wchar_t* Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot convert between UTF-16 and wchar_t. You, however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare UTF-16 as the encoding for wchar_t. However, that is not universal true. Different platform can chose the size of wchar_t and the internal representation of wchar_t* according to POSIX. markus
Re: 3rd-party cross-platform UTF-8 support
Yung-Fong Tang wrote: UTF-16 - wchar_t* Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot convert between UTF-16 and wchar_t. You, however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare UTF-16 as the encoding for wchar_t. However, that is not universal true. Different platform can chose the size of wchar_t and the internal representation of wchar_t* according to POSIX. I know. Don't get me started on the usefulness of wchar_t... We handle this in our convenience function as best as we could figure out. That's what makes it _convenient_ ;-) [Granted, it might also not work everywhere, but it is better than nothing.] markus
Re: 3rd-party cross-platform UTF-8 support
On Fri, Sep 21, 2001 at 04:16:50PM -0700, Yung-Fong Tang wrote: Then... use Unicode internally in your software regardless you use UTF-8 or UCS2 as the data type in the interface, eventually some code need to convert it to UCS2 for most of the processing. Why? UCS2 shouldn't be used at all, since it's only BMP. UTF-16 has all the problems of UTF-8, except in a more limited way. If you can deal with mixed 2 byte and 4 byte characters, you can also deal 1, 2, 3 and 4 byte characters. -- David Starner - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org When the aliens come, when the deathrays hum, when the bombers bomb, we'll still be freakin' friends. - Freakin' Friends
RE: 3rd-party cross-platform UTF-8 support
UTF-16 - wchar_t* Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot convert between UTF-16 and wchar_t. You, however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare UTF-16 as the encoding for wchar_t. And he can also do some between UTF-16 and UTF-32 for glibc-based programs since UTF-32 is the encoding for wchar_t for such platforms. The way I read that was UTF-16 - UTF-(8*sizeof(wchar_t)). (Please don't ask what happens when sizeof(wchar_t) is 3 or larger than 4, you know what I mean :)). I guess the responsibility of this being a meaningful conversion would be with the caller. YA PS: I don't know a way of knowing the encoding of wchar_t programmatically. Is there one? That'd offer some interesting possibilities..
Re: 3rd-party cross-platform UTF-8 support
Changjian Sun said: For cross-platform software (NT,Solaris,HP,AIX), the only 3rd-party unicode support I found so far is IBM ICU. It's a very good support for cross-platform software internationalization. However, ICU internally uses UTF-16, For our application using UTF-8 as input and output, I have to convert from UTF-8 to UTF-16, before calling ICU functions (such as ucol_strcoll() ) I'm worried about the performance overhead of this conversion. You shouldn't be. The conversion from UTF-8 to UTF-16 and back is algorithmic and very fast. If you are expecting better performance from a library that takes UTF-8 API's and then does all its internal processing in UTF-8 *without* converting to UTF-16, then I think you are mistaken. UTF-8 is a bad form for much of the kind of internal processing that ICU has to do for all kinds of things -- particularly for collation weighting, for example. Any library worth its salt would *first* convert to UTF-16 (or UTF-32) internally, anyway, before doing any significant semantic manipulation of the characters. Are there any other cross-platform 3rd party unicode supports with better UTF-8 handling ? In my opinion, it is unlikely that there are *any* good Unicode libraries that provide pure UTF-8 handling only, inside and out. It is just more efficient, elegant, and higher-performance to take the form conversion hit, but then use a better processing form for manipulating the characters. UTF-8 shines as a legacy API and protocol compatibility form. But it stinks as a processing form. --Ken Thanks a lot. -Changjian Sun
Re: 3rd-party cross-platform UTF-8 support
On Thu, Sep 20, 2001 at 02:02:37PM -0400, [EMAIL PROTECTED] wrote: I'm worried about the performance overhead of this conversion. How much is this performance overhead? Converting UTF-8 to UTF-16 is computationally trivial; my guess is that it would be significant for cat or grep (maybe . . . the running time Unicode regexs and canonization of the input may dwarf the running time of the conversion), but not for anything that will run for a significant time or do significant processing on the input (say a wordprocessor, or a speech synthesizer). My guess on the overhead may be wrong, but the only way to really find out is to actually measure it - always a good idea in optimization. -- David Starner - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org When the aliens come, when the deathrays hum, when the bombers bomb, we'll still be freakin' friends. - Freakin' Friends
RE: 3rd-party cross-platform UTF-8 support
Ken I have to convert from UTF-8 to UTF-16, before calling ICU functions (such as ucol_strcoll() ) I'm worried about the performance overhead of this conversion. You shouldn't be. The conversion from UTF-8 to UTF-16 and back is algorithmic and very fast. To make this conversion fast in xIUA http://www.xnetinc.com/xiua/ I use an externalized version of this converter so I don't have to go through and of the common ICU conversation overhead. However there is much more to UTF-8 support then just a converter. Many string handling functions require separate deployments. I agree totally, it is easier to write a collator in UTF-16 even easier to write one in UTF-32. The cost of conversion to UTF-16 is probably made up in the improved efficiency. If you are expecting better performance from a library that takes UTF-8 API's and then does all its internal processing in UTF-8 *without* converting to UTF-16, then I think you are mistaken. UTF-8 is a bad form for much of the kind of internal processing that ICU has to do for all kinds of things -- particularly for collation weighting, for example. Any library worth its salt would *first* convert to UTF-16 (or UTF-32) internally, anyway, before doing any significant semantic manipulation of the characters. Are there any other cross-platform 3rd party unicode supports with better UTF-8 handling ? I would not have written xIUA if I know of a better alternative. I also think that many people like the setlocale stile of programming with and API that looks like standard C library calls such as xiua_strcoll(str1,str2); If all you need is UTF-8 there are things that you can do with xIUA. It is easier to strip out functionality than add it. Carl