Re: c++ strings and UTF-8 (other charsets)

Rich Felker Mon, 26 Feb 2007 18:09:51 -0800

On Mon, Feb 26, 2007 at 03:35:05PM +0100, Stephane Bortzmeyer wrote:
> On Mon, Feb 26, 2007 at 08:10:59AM +0100,
>  Marcel Ruff <[EMAIL PROTECTED]> wrote 
>  a message of 65 lines which said:
> 
> > As UTF-8 may not contain '\0' you can simply use all functions as
> > before (strcmp(), std::string etc.).
> 
> As long as you just store or retrieve strings. If you compare them
> (strcmp), you HAVE TO take normalization into account.


No you don't. Nothing in Unicode says that you must treat canonically
equivalent strings as identical, and in fact doing so is a bad idea in
most of the situations I've worked with. Unicode only says that you
should not assume that another process (in the Unicode sense of the
word "process") will treat them as being distinct.

If your particular application has a special need for normalization,
then yes you need to take it into account. But if you're doing
something like passing around filenames you most surely should not be
normalizing anything.

> If you measure
> them (strlen), you HAVE TO use a character semantic, not a byte
> semantic. And so on.

Huh? Length in characters is basically useless to know. Length in
bytes and width of the text when rendered to a visual presentation are
both useful, but the only place where knowing length in number of
characters is useful is for fields that are limited to a fixed number
of characters. If the limit is for the sake of using a fixed-size
storage object, then this limit should just be changed to a limit in
bytes instead of in characters..

> > Old code doesn't need to be ported.
> 
> Very strange advice, indeed.

?? Hardly strange.. It depends on what the code does. See Markus
Kuhn's UTF-8 FAQ.

But Marcel is right about a lot of old code (just not all). Most code
doesn't care at all about the contents of the text, just that it's a
string.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: c++ strings and UTF-8 (other charsets)

Reply via email to