> - Are there other reasonable ways? > - Is it good to use UTF-32 on Unix and UTF-16 on Windows?
I'd advise against it. There is really no advantage to using UTF-32 anymore. It wont tell you where its safe to break strings, you can still have invalid sequences, some characters will still require multiple codepoints, and you will most certainly still need byte streams for i/o and serialization, youll possibly have to deal with endianness problems if you want portability, Its thousands of times easier to use utf-8, even on windows, imo. Most of the time you can just ignore the fact that its in utf-8, and treat it as an old-fashioned ascii string. When you need to do one of the operations that require special handling, its no more work than youd have to do for utf-32. I would generally advise against doing what perl does: the decision to automatically consider all I/O to be text mode with automatic clumsy conversion has been a massive bug-generator. Its better to treat I/O as binary, and let the user call a conversion function if desired. Another suggestion for your language: keep the source code in the same encoding as your strings. If you want a UTF-32 language, then require all source code to be encoded in that same encoding as well. Allow identifiers, comments, and literals to all be in that encoding so you dont get stuck in C++'s situation, where you have to use English for almost everything. (I personally thing C++ should allow utf-8 everywhere, with no compiler enforced normalization form whatsoever.) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
