Re: c++ strings and UTF-8 (other charsets)

Keld Jørn Simonsen Tue, 27 Feb 2007 17:31:54 -0800

On Mon, Feb 26, 2007 at 03:35:05PM +0100, Stephane Bortzmeyer wrote:
> On Mon, Feb 26, 2007 at 08:10:59AM +0100,
>  Marcel Ruff <[EMAIL PROTECTED]> wrote 
>  a message of 65 lines which said:
> 
> > As UTF-8 may not contain '\0' you can simply use all functions as
> > before (strcmp(), std::string etc.).
> 
> As long as you just store or retrieve strings. If you compare them
> (strcmp), you HAVE TO take normalization into account. If you measure
> them (strlen), you HAVE TO use a character semantic, not a byte
> semantic. And so on.


No you do not have to normalize the data to compare. That is, if you
follow ISO 14651/Unicode to compare at some precision, different from
absolute equality, the comparison will work for unnormalized data.
And that is the normal way of comparison anyway. Eg for looking after
a phrase in a document, you would normally do a case insensitive
comparison. And even if you do a case sensitive comparison you could use 
14651 data or the data for your locale on unnormalized data.

The only catch is that 14882 does not provide an API for doing 14651
collating on different levels of precision. Maybe we could make such an
API, but probably in a future library TR.

best regards
keld

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: c++ strings and UTF-8 (other charsets)

Reply via email to