On Thu, Oct 04, 2001 at 10:45:41AM -0700, Carl W. Brown wrote: > You are right the while functions like strstr will work with UTF-8 they are much >slower. strstr compares the matching string to the source byte by byte until a >mismatch. Then it increments the source by one byte. If this byte is a continuation >character there will be no hit. This should not be too much of a problem since it >should immediately mismatch. It is a bit slower but not too bad.
A big hit, but I wonder how much is avoidable. The three cases for this, I think, are: strstr (dumb, ends up comparing continuation bytes); strstr that knows utf8 (avoid comparing those bytes); or converting to UCS-2 or UCS-4 and doing a memcmp. I think skipping continuations would be a speed hit--you'd be taking the (minor) hit of UTF-8 decoding logic for every character, and all you're saving is a few byte compares. (Actually, a lot of byte compares, but it's a lot less code.) > I don't think that the extra paging due to extra memory usage is too bad. We get >bigger and faster systems every day. I disagree--realize that if you're dealing with English text and converting to UCS-4, you're blowing the strings up to 4x their original size. This is less of an issue for most other languages, of course, where UTF-8 is bigger, but you still are adding a very expensive strcpy (essentially, anyway--the cost of copying and conversion) for every comparison. In general, I don't think it's a good idea to discard the idea of keeping code quick--and in the case of fundamental string operations, it's still very important. (I want faster systems to mean my programs run *faster*; I don't want to break even. :) > If you use wide character support you have to use it everywhere. You can not >convert a string from UTF-8 to UTF-32 and tokenize it with wcstok and expect that the >results be mapped back to the original UTF-8 string. You can to go WC all the way. >That means a lot of program constants that will also have to be changed. With UTF-8 >you don't have to change any constants that are pure ASCII. > The big hit comes with debugging. It is a pain to read the UTF-32 strings. This >really increases the development cost especially with non-i18n programmers who don't >keep a copy of the Unicode book on their desk at all times. This leaves many people preferring UTF-8--and leaving the above situation (and likewise for all string ops). Looking at the three major choices: UTF-8, UCS-2 or 4, DBCS, all seem to have maojr pitfalls right now. UTF-8 leaves us with slow string ops; UCS-2 and 4 leave us with more memory usage, much harder debugging; DBCS leaves us with an unknown string type (the program doesn't know anything about it); both UCS-2/4 and UTF-8 mean you're doing a bunch of conversion if you want full MBCS support, UCS-2/4 means you're doing conversions for all I/O, and MBCS means iterating in reverse is slow. I'd probably take the C++ route, if I wanted full MBCS support: keep strings in their native format whenever possible, convert if needed for speed, special case stuff like reverse iterators for UTF-8. This isn't really appropriate for small projects, though. Too bad C++'s own string class sucks. > You can do that with xIUA and ICU. In fact you might want to use the same sort of >support with glibc. That way if you want to go to ICU later or port to another >platform you only have one piece of code to change. > xIUA supports different encodings dynamically. You can have a routine that gets >called with EUC-JP, UTF-8 or UTF-32 data and they all are handled correctly. You can >also invoke the UTF-8 support explicitly and save the overhead of checking to see >which routine to call. If you are communication with browsers for example, they >don't all support UTF-8 properly. It even has a bonus for HTML and XML it that you >can tell the converter that any character that does not convert it will automatically >convert to a NCR sequence. This way you can send Japanese with the iso-5589-1 code >page and not lose a character. Sounds like the above. Sounds like it might be fairly big, too, which is annoying for most small- and medium- sized projects--most OSS developers are very hesitant to add a major dependancy, especially the cross-platform ones (where this will often mean shipping binary packages along with the runtimes for the dependancy.) -- Glenn Maynard - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
