Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Andy Heninger

From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED]

 Why would UTF-16 be easier for internal processing than UTF-8?
 Both are variable-length encodings.
 

Performance tuning is easier with UTF-16.  You can optimize for
BMP characters, knowing that surrogate pairs are sufficiently uncommon
that it's OK for them take a bail-out slow path.  

Andy Heninger
IBM, Cupertino, CA
[EMAIL PROTECTED]










Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Andy Heninger

From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED]

 Why would UTF-16 be easier for internal processing than UTF-8?
 Both are variable-length encodings.
 

Performance tuning is easier with UTF-16.  You can optimize for
BMP characters, knowing that surrogate pairs are sufficiently uncommon
that it's OK for them take a bail-out slow path.  

Andy Heninger
IBM, Cupertino, CA
[EMAIL PROTECTED]








Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Andy Heninger

From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED]

 Why would UTF-16 be easier for internal processing than UTF-8?
 Both are variable-length encodings.
 

Performance tuning is easier with UTF-16.  You can optimize for
BMP characters, knowing that surrogate pairs are sufficiently uncommon
that it's OK for them take a bail-out slow path.  

Andy Heninger
IBM, Cupertino, CA
[EMAIL PROTECTED]









Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Andy Heninger

From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED]

 Why would UTF-16 be easier for internal processing than UTF-8?
 Both are variable-length encodings.
 

Performance tuning is easier with UTF-16.  You can optimize for
BMP characters, knowing that surrogate pairs are sufficiently uncommon
that it's OK for them take a bail-out slow path.  

Andy Heninger
IBM, Cupertino, CA
[EMAIL PROTECTED]









Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Tom Emerson

Andy Heninger writes:
 Performance tuning is easier with UTF-16.  You can optimize for
 BMP characters, knowing that surrogate pairs are sufficiently uncommon
 that it's OK for them take a bail-out slow path.  

Sure, but if you are using UTF-16 (or any other multibyte encoding)
you loose the ability to index characters in an array in constant
time. For some applications that isn't desirable.

-tree

-- 
Tom Emerson  Basis Technology Corp.
Sr. Sinostringologist  http://www.basistech.com
  Beware the lollipop of mediocrity: lick it once and you suck forever




RE: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Carl W. Brown

Tom,

 Andy Heninger writes:
  Performance tuning is easier with UTF-16.  You can optimize for
  BMP characters, knowing that surrogate pairs are sufficiently uncommon
  that it's OK for them take a bail-out slow path.

 Sure, but if you are using UTF-16 (or any other multibyte encoding)
 you loose the ability to index characters in an array in constant
 time. For some applications that isn't desirable.

If you implement an array that is directly indexed by Unicode code point it
would have to have 1114111 entries.  (I love the number)  I don't think that
many applications can afford to have over a megabyte of storage per byte of
table width.  If nothing else it would be an array of addresses pointing to
valid entries that would take about 4.5 MB.  Because the new plains are
sparsely populated you can segment your table.  In this case you have no
real advantage using UTF-32.

I though that Basis Technology was developed using UCS-2.  Have you
converted to full UTF-16 support or are you thinking of changing?

Carl





RE: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Tom Emerson

Carl W. Brown writes:
 If you implement an array that is directly indexed by Unicode code point it
 would have to have 1114111 entries.  (I love the number)  I don't think that
 many applications can afford to have over a megabyte of storage per byte of
 table width.  If nothing else it would be an array of addresses pointing to
 valid entries that would take about 4.5 MB.  Because the new plains are
 sparsely populated you can segment your table.  In this case you have no
 real advantage using UTF-32.

That wasn't my point: obviously one would not create a lookup table
using raw Unicode values.

But if I have a text string, and that string is encoded in UTF-16, and
I want to access Unicode character values, then I cannot index that
string in constant time.

To find character n I have to walk all of the 16-bit values in that
string accounting for surrogates. If I use UTF-32 I don't need to do
that. This very issue came up during the discussion of how to handle
surrogates in Python.

 I though that Basis Technology was developed using UCS-2.  Have you
 converted to full UTF-16 support or are you thinking of changing?

The current shipping version of Rosette uses UCS-2 internally. Current
development is focusing on UTF-16 and UTF-32 support.

-tree

-- 
Tom Emerson  Basis Technology Corp.
Sr. Sinostringologist  http://www.basistech.com
  Beware the lollipop of mediocrity: lick it once and you suck forever




Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Michael \(michka\) Kaplan

From: Tom Emerson [EMAIL PROTECTED]

 But if I have a text string, and that string is encoded in UTF-16, and
 I want to access Unicode character values, then I cannot index that
 string in constant time.

 To find character n I have to walk all of the 16-bit values in that
 string accounting for surrogates. If I use UTF-32 I don't need to do
 that. This very issue came up during the discussion of how to handle
 surrogates in Python.

Would this not be the same issue for composite characters, even *in* UTF-32?
If you truly mean to work with characters here then it seems this is a
problem you can always have.


MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/






Re: 3rd-party cross-platform UTF-8 support

2001-09-24 Thread Tom Emerson

Michael \(michka\) Kaplan writes:
  To find character n I have to walk all of the 16-bit values in that
  string accounting for surrogates. If I use UTF-32 I don't need to do
  that. This very issue came up during the discussion of how to handle
  surrogates in Python.
 
 Would this not be the same issue for composite characters, even *in* UTF-32?

Yes, absolutely. However, in the case of Python we were concerned with
being able to access a surrogate as a valid assigned single character.

 If you truly mean to work with characters here then it seems this is a
 problem you can always have.

Of course.

-tree

-- 
Tom Emerson  Basis Technology Corp.
Sr. Sinostringologist  http://www.basistech.com
  Beware the lollipop of mediocrity: lick it once and you suck forever




Re: 3rd-party cross-platform UTF-8 support

2001-09-22 Thread Marcin 'Qrczak' Kowalczyk

Thu, 20 Sep 2001 12:46:49 -0700 (PDT), Kenneth Whistler [EMAIL PROTECTED] pisze:

 If you are expecting better performance from a library that takes UTF-8
 API's and then does all its internal processing in UTF-8 *without*
 converting to UTF-16, then I think you are mistaken. UTF-8 is a bad
 form for much of the kind of internal processing that ICU has to do
 for all kinds of things -- particularly for collation weighting, for
 example. Any library worth its salt would *first* convert to UTF-16
 (or UTF-32) internally, anyway, before doing any significant semantic
 manipulation of the characters.

Why would UTF-16 be easier for internal processing than UTF-8?
Both are variable-length encodings.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK





Re: 3rd-party cross-platform UTF-8 support

2001-09-22 Thread Michael \(michka\) Kaplan

From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED]

 Why would UTF-16 be easier for internal processing than UTF-8?
 Both are variable-length encodings.

Good straw man!

Working with UTF-16 is immensely easier than working with UTF-8. As I am am
sure you know! :-)


MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/






Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Markus Scherer

I would like to add that ICU 2.0 (in a few weeks) will have convenience functions for 
in-process string transformations:

UTF-16 - UTF-8
UTF-16 - UTF-32
UTF-16 - wchar_t*

markus




Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Yung-Fong Tang


Mozilla also use Unicode internally and are cross platform.
[EMAIL PROTECTED] wrote:

For cross-platform software (NT,Solaris,HP,AIX),
the only 3rd-party unicode support
I found so far is IBM ICU.
It's a very good support for
cross-platform software internationalization. However,
ICU internally uses UTF-16, For
our application using UTF-8 as input and output,
I have to convert from UTF-8
to UTF-16, before calling ICU functions (such as ucol_strcoll() )
I'm worried about the performance
overhead of this conversion.
Then... use Unicode internally in your software regardless you use
UTF-8 or UCS2 as the data type in the interface, eventually some code need
to convert it to UCS2 for most of the processing. Unless you use UCS2 internally,
you need to pay for the performance, either inside the library our in your
own code.


Are there any other cross-platform
3rd party unicode supports with better UTF-8 handling ?
Thanks a lot.
-Changjian Sun



Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Yung-Fong Tang



Markus Scherer wrote:

 I would like to add that ICU 2.0 (in a few weeks) will have convenience functions 
for in-process string transformations:

 UTF-16 - UTF-8
 UTF-16 - UTF-32
 UTF-16 - wchar_t*

Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot 
convert between UTF-16 and wchar_t. You,
however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare 
UTF-16 as the encoding for wchar_t.
However, that is not universal true. Different platform can chose the size of wchar_t 
and the internal representation of
wchar_t* according to POSIX.



 markus





Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Markus Scherer

Yung-Fong Tang wrote:
  UTF-16 - wchar_t*
 
 Wait be careful here. wchar_t is not an encoding. So.. in theory, you cannot 
convert between UTF-16 and wchar_t. You,
 however, can convert between UTF-16 and wchar_t* ON win32 since microsoft declare 
UTF-16 as the encoding for wchar_t.
 However, that is not universal true. Different platform can chose the size of 
wchar_t and the internal representation of
 wchar_t* according to POSIX.

I know. Don't get me started on the usefulness of wchar_t...
We handle this in our convenience function as best as we could figure out.
That's what makes it _convenient_ ;-)

[Granted, it might also not work everywhere, but it is better than nothing.]

markus




Re: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread David Starner

On Fri, Sep 21, 2001 at 04:16:50PM -0700, Yung-Fong Tang wrote:
 Then... use Unicode internally in your software regardless you use
 UTF-8 or UCS2 as the data type in the interface, eventually some code
 need to convert it to UCS2 for most of the processing. 

Why? UCS2 shouldn't be used at all, since it's only BMP. UTF-16 has all
the problems of UTF-8, except in a more limited way. If you can deal
with mixed 2 byte and 4 byte characters, you can also deal 1, 2, 3 and 4
byte characters. 

-- 
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
When the aliens come, when the deathrays hum, when the bombers bomb,
we'll still be freakin' friends. - Freakin' Friends




RE: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Yves Arrouye

  UTF-16 - wchar_t*
 
 Wait be careful here. wchar_t is not an encoding. So.. in 
 theory, you cannot convert between UTF-16 and wchar_t. You,
 however, can convert between UTF-16 and wchar_t* ON win32 
 since microsoft declare UTF-16 as the encoding for wchar_t.

And he can also do some between UTF-16 and UTF-32 for glibc-based programs
since UTF-32 is the encoding for wchar_t for such platforms.

The way I read that was UTF-16 - UTF-(8*sizeof(wchar_t)). (Please don't
ask what happens when sizeof(wchar_t) is 3 or larger than 4, you know what I
mean :)). I guess the responsibility of this being a meaningful conversion
would be with the caller.

YA

PS: I don't know a way of knowing the encoding of wchar_t programmatically.
Is there one? That'd offer some interesting possibilities..




Re: 3rd-party cross-platform UTF-8 support

2001-09-20 Thread Kenneth Whistler

Changjian Sun said:

 For cross-platform software (NT,Solaris,HP,AIX), the only 3rd-party 
 unicode support
 I found so far is IBM ICU. 
 It's a very good support for cross-platform software internationalization. 
 However, 
 ICU internally uses UTF-16, For our application using UTF-8 as input and 
 output,
 I have to convert from UTF-8 to UTF-16, before calling ICU functions (such 
 as ucol_strcoll() )
 
 I'm worried about the performance overhead of this conversion.

You shouldn't be.

The conversion from UTF-8 to UTF-16 and back is algorithmic and very
fast.

If you are expecting better performance from a library that takes UTF-8
API's and then does all its internal processing in UTF-8 *without*
converting to UTF-16, then I think you are mistaken. UTF-8 is a bad
form for much of the kind of internal processing that ICU has to do
for all kinds of things -- particularly for collation weighting, for
example. Any library worth its salt would *first* convert to UTF-16
(or UTF-32) internally, anyway, before doing any significant semantic
manipulation of the characters.

 Are there any other cross-platform 3rd party unicode supports with better 
 UTF-8 handling ?

In my opinion, it is unlikely that there are *any* good Unicode libraries
that provide pure UTF-8 handling only, inside and out. It is just
more efficient, elegant, and higher-performance to take the form
conversion hit, but then use a better processing form for manipulating
the characters.

UTF-8 shines as a legacy API and protocol compatibility form.
But it stinks as a processing form.

--Ken

 Thanks a lot.
 
 -Changjian Sun




Re: 3rd-party cross-platform UTF-8 support

2001-09-20 Thread David Starner

On Thu, Sep 20, 2001 at 02:02:37PM -0400, [EMAIL PROTECTED] wrote:
 I'm worried about the performance overhead of this conversion.

How much is this performance overhead? Converting UTF-8 to UTF-16 is
computationally trivial; my guess is that it would be significant for
cat or grep (maybe . . . the running time Unicode regexs and
canonization of the input may dwarf the running time of the conversion),
but not for anything that will run for a significant time or do
significant processing on the input (say a wordprocessor, or a speech
synthesizer).

My guess on the overhead may be wrong, but the only way to really find
out is to actually measure it - always a good idea in optimization.

-- 
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
When the aliens come, when the deathrays hum, when the bombers bomb,
we'll still be freakin' friends. - Freakin' Friends




RE: 3rd-party cross-platform UTF-8 support

2001-09-20 Thread Carl W. Brown

Ken

  I have to convert from UTF-8 to UTF-16, before calling ICU
 functions (such
  as ucol_strcoll() )
 
  I'm worried about the performance overhead of this conversion.

 You shouldn't be.

 The conversion from UTF-8 to UTF-16 and back is algorithmic and very
 fast.

To make this conversion fast in xIUA http://www.xnetinc.com/xiua/ I use an
externalized version of this converter so I don't have to go through and of
the common ICU conversation overhead.

However there is much more to UTF-8 support then just a converter.  Many
string handling functions require separate deployments.

I agree totally, it is easier to write a collator in UTF-16 even easier to
write one in UTF-32. The cost of conversion to UTF-16 is probably made up in
the improved efficiency.


 If you are expecting better performance from a library that takes UTF-8
 API's and then does all its internal processing in UTF-8 *without*
 converting to UTF-16, then I think you are mistaken. UTF-8 is a bad
 form for much of the kind of internal processing that ICU has to do
 for all kinds of things -- particularly for collation weighting, for
 example. Any library worth its salt would *first* convert to UTF-16
 (or UTF-32) internally, anyway, before doing any significant semantic
 manipulation of the characters.

  Are there any other cross-platform 3rd party unicode supports
 with better
  UTF-8 handling ?

I would not have written xIUA if I know of a better alternative.

I also think that many people like the setlocale stile of programming with
and API that looks like standard C library calls such as
xiua_strcoll(str1,str2);

If all you need is UTF-8 there are things that you can do with xIUA.  It is
easier to strip out functionality than add it.

Carl