Re: c++ strings and UTF-8 (other charsets)

Rich Felker Sun, 25 Feb 2007 15:53:30 -0800

On Sat, Feb 24, 2007 at 06:13:37PM +0100, Julien Claassen wrote:
> Hi!
>   What I meant about UTF-8-strings in c++: I mean in c and c++ they're not 
> standard like in Java.


UTF-16, used by Java, is also variable-width. It can be either 2 bytes
or 4 bytes per character. Support for the characters that use 4 bytes
is generally very poor due to the misconception that it's
fixed-width.. :(

> I think UTF-8 is a variable width multibyte charset, so 
> there are specific problems in handling them allocating the right space. I 
> mean the Glib contains something like UString and QT has its QStrings, which 
> I think are also UTF-8 capable.

All strings are UTF-8 capable; the unit of data is simply bytes
instead of characters. If you're looking for a class that treats
strings as a sequence of abstract characters rather than a sequence of
bytes, you could look for a library to do this or write your own.
However I suspect the most useful way to do this on C++ would be to
extend whatever standard byte-based string class you're using with a
derived class.

Maybe there's something like this built in to the C++ STL classes
already that I'm not aware of. As I said I don't know much of (modern)
C++. Can someone who knows the language better provide an answer?

It would also be easier to provide you answers if we knew better what
you're trying to do with the strings, i.e. whether you just need to
store them and spit them back in output, or whether you need to do
higher-level unicode processing like line breaks, collation,
rendering, etc.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: c++ strings and UTF-8 (other charsets)

Reply via email to