Re: c++ strings and UTF-8 (other charsets)

ＳｒｉｎＴｕａｒ Tue, 27 Feb 2007 06:59:11 -0800

On Mon, Feb 26, 2007 at 03:35:05PM +0100, Stephane Bortzmeyer wrote:

> Old code doesn't need to be ported.


Very strange advice, indeed.


You might want to read up on the history of UTF-8.
Not needed to make any code changes at all to most applications was in
fact one of the primary design goal of the encoding.

If you measure them (strlen), you HAVE TO use a character semantic,

not a byte semantic.

I have yet to encounter a case where a "character" count is useful.
Display length is sometimes useful, mostly in graphics or UI code, but
even then it has little to do with character count. 99.5% of the
times, strlen is used to determine storage requirements or buffer
length.

If you compare them (strcmp), you HAVE TO take normalization into account.


Hrm, I would say that is incorrect. You don't want to normalize input
most of the time.
When you are going to case-fold, perhaps for searching, its almost
always allright to normalize. If you are a big fat word processor, or
an import/conversion tool, its also okay. Most other programs are
better off not normalizing or even being aware of the concept, and
are better off assuming that their input is in a suitable format for
storage or output.

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: c++ strings and UTF-8 (other charsets)

Reply via email to