Re: Unicode support under Linux

Glenn Maynard Thu, 04 Oct 2001 17:50:30 -0700

On Thu, Oct 04, 2001 at 10:45:41AM -0700, Carl W. Brown wrote:
> You are right the while functions like strstr will work with UTF-8 they are much 
>slower.  strstr compares the matching string to the source byte by byte until a 
>mismatch.  Then it increments the source by one byte.  If this byte is a continuation 
>character there will be no hit.  This should not be too much of a problem since it 
>should immediately mismatch.  It is a bit slower but not too bad.


A big hit, but I wonder how much is avoidable.  The three cases for this, I
think, are: strstr (dumb, ends up comparing continuation bytes); strstr
that knows utf8 (avoid comparing those bytes); or converting to UCS-2 or
UCS-4 and doing a memcmp.

I think skipping continuations would be a speed hit--you'd be taking
the (minor) hit of UTF-8 decoding logic for every character, and all
you're saving is a few byte compares.  (Actually, a lot of byte
compares, but it's a lot less code.)

> I don't think that the extra paging due to extra memory usage is too bad.  We get 
>bigger and faster systems every day.

I disagree--realize that if you're dealing with English text and
converting to UCS-4, you're blowing the strings up to 4x their original
size.  This is less of an issue for most other languages, of course, where
UTF-8 is bigger, but you still are adding a very expensive strcpy
(essentially, anyway--the cost of copying and conversion) for every
comparison.

In general, I don't think it's a good idea to discard the idea of
keeping code quick--and in the case of fundamental string operations,
it's still very important.  (I want faster systems to mean my programs
run *faster*; I don't want to break even. :)

> If you use wide character support you have to use it everywhere.  You can not 
>convert a string from UTF-8 to UTF-32 and tokenize it with wcstok and expect that the 
>results be mapped back to the original UTF-8 string.  You can to go WC all the way.  
>That means a lot of program constants that will also have to be changed.  With UTF-8 
>you don't have to change any constants that are pure ASCII.
> The big hit comes with debugging.  It is a pain to read the UTF-32 strings.  This 
>really increases the development cost especially with non-i18n programmers who don't 
>keep a copy of the Unicode book on their desk at all times.

This leaves many people preferring UTF-8--and leaving the above
situation (and likewise for all string ops).

Looking at the three major choices: UTF-8, UCS-2 or 4, DBCS, all seem to
have maojr pitfalls right now.  UTF-8 leaves us with slow string ops;
UCS-2 and 4 leave us with more memory usage, much harder debugging; DBCS
leaves us with an unknown string type (the program doesn't know anything
about it); both UCS-2/4 and UTF-8 mean you're doing a bunch of conversion
if you want full MBCS support, UCS-2/4 means you're doing conversions for
all I/O, and MBCS means iterating in reverse is slow.

I'd probably take the C++ route, if I wanted full MBCS support:
keep strings in their native format whenever possible, convert if needed
for speed, special case stuff like reverse iterators for UTF-8.  This
isn't really appropriate for small projects, though.  Too bad C++'s own
string class sucks.

> You can do that with xIUA and ICU.  In fact you might want to use the same sort of 
>support with glibc.  That way if you want to go to ICU later or port to another 
>platform you only have one piece of code to change.
> xIUA supports different encodings dynamically.  You can have a routine that gets 
>called with EUC-JP, UTF-8 or UTF-32 data and they all are handled correctly.  You can 
>also invoke the UTF-8 support explicitly and save the overhead of checking to see 
>which routine to call.  If you are communication with browsers for example, they 
>don't all support UTF-8 properly.  It even has a bonus for HTML and XML it that you 
>can tell the converter that any character that does not convert it will automatically 
>convert to a NCR sequence.  This way you can send Japanese with the iso-5589-1 code 
>page and not lose a character. 

Sounds like the above.  Sounds like it might be fairly big, too, which
is annoying for most small- and medium- sized projects--most OSS
developers are very hesitant to add a major dependancy, especially the
cross-platform ones (where this will often mean shipping binary packages
along with the runtimes for the dependancy.)

-- 
Glenn Maynard
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode support under Linux

Reply via email to