On Mon, Feb 26, 2007 at 03:35:05PM +0100, Stephane Bortzmeyer wrote: > On Mon, Feb 26, 2007 at 08:10:59AM +0100, > Marcel Ruff <[EMAIL PROTECTED]> wrote > a message of 65 lines which said: > > > As UTF-8 may not contain '\0' you can simply use all functions as > > before (strcmp(), std::string etc.). > > As long as you just store or retrieve strings. If you compare them > (strcmp), you HAVE TO take normalization into account.
No you don't. Nothing in Unicode says that you must treat canonically equivalent strings as identical, and in fact doing so is a bad idea in most of the situations I've worked with. Unicode only says that you should not assume that another process (in the Unicode sense of the word "process") will treat them as being distinct. If your particular application has a special need for normalization, then yes you need to take it into account. But if you're doing something like passing around filenames you most surely should not be normalizing anything. > If you measure > them (strlen), you HAVE TO use a character semantic, not a byte > semantic. And so on. Huh? Length in characters is basically useless to know. Length in bytes and width of the text when rendered to a visual presentation are both useful, but the only place where knowing length in number of characters is useful is for fields that are limited to a fixed number of characters. If the limit is for the sake of using a fixed-size storage object, then this limit should just be changed to a limit in bytes instead of in characters.. > > Old code doesn't need to be ported. > > Very strange advice, indeed. ?? Hardly strange.. It depends on what the code does. See Markus Kuhn's UTF-8 FAQ. But Marcel is right about a lot of old code (just not all). Most code doesn't care at all about the contents of the text, just that it's a string. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
