Re: c++ strings and UTF-8 (other charsets)

Marcel Ruff Tue, 27 Feb 2007 03:17:30 -0800

Rich Felker wrote:

On Mon, Feb 26, 2007 at 03:35:05PM +0100, Stephane Bortzmeyer wrote:

On Mon, Feb 26, 2007 at 08:10:59AM +0100,

Marcel Ruff <[EMAIL PROTECTED]> wrotea message of 65 lines which said:

As UTF-8 may not contain '\0' you can simply use all functions as
before (strcmp(), std::string etc.).

As long as you just store or retrieve strings. If you compare them
(strcmp), you HAVE TO take normalization into account.


No you don't. Nothing in Unicode says that you must treat canonically
equivalent strings as identical, and in fact doing so is a bad idea in
most of the situations I've worked with. Unicode only says that you
should not assume that another process (in the Unicode sense of the
word "process") will treat them as being distinct.

If your particular application has a special need for normalization,
then yes you need to take it into account. But if you're doing
something like passing around filenames you most surely should not be
normalizing anything.

If you measure
them (strlen), you HAVE TO use a character semantic, not a byte
semantic. And so on.


Huh? Length in characters is basically useless to know. Length in
bytes and width of the text when rendered to a visual presentation are
both useful, but the only place where knowing length in number of
characters is useful is for fields that are limited to a fixed number
of characters. If the limit is for the sake of using a fixed-size
storage object, then this limit should just be changed to a limit in
bytes instead of in characters..

Old code doesn't need to be ported.

Very strange advice, indeed.


?? Hardly strange.. It depends on what the code does. See Markus
Kuhn's UTF-8 FAQ.

But Marcel is right about a lot of old code (just not all). Most code
doesn't care at all about the contents of the text, just that it's a
string.

Thanks for all those details.

I can only tell that when i started to port a C and a C++ library tosupport unicodeon Linux/Unix/Windows/WindowsCE is was totally lost with the heaps ofcomplicatedand confusing advice found in the internet (the reason why i joined thismailing list).


But in the end everything was very simple:

1. UTF-8 does not contain zero bytes

2. Doing all in UTF-8 and keeping my std::string and char* was a verysimple solution3. I would need to define own data types if i want to support UTF-16(similar to xerces an all the others)

  This would be a major effort.

4. Take care when passing the strings to other libraries / GUIs asmentioned in my first post


Getting to above *simple* insight took me several confused days,
after that the porting effort was done in one day.

I just wanted to share this to save others all the confusion,

Marcel

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/



--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: c++ strings and UTF-8 (other charsets)

Reply via email to