Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote: > All UTF encodings (including the SCSU compressed encoding, or BOCU-8 > which is a variant of UTF-8, or also now the GB18030 Chinese standard > which is now a valid representation of Unicode) have their pros and > cons.
UTF's by definition are stateless and have exactly one valid representation for each code point. So SCSU, much as I like it, is not a UTF. BOCU-1 is also not a UTF, and in particular there is no conceivable way it can be regarded as "a variant of UTF-8." I have no idea what "BOCU-8" is. Maybe that one really is a variant of UTF-8. Though not promulgated by Unicode, GB18030 can be considered a UTF, since it is really just a mapping from Unicode code points to sequences of 1, 2, or 4 bytes. Later: > SCSU is excellent for immutable strings, and is a *very* tiny overhead > above ISO-8859-1 (note that the conversion from ISO-8859-1 to SCSU is > extremely trivial, may be even simpler than to UTF-8!) An ISO 8859-1 string that contains no controls except NUL, CR, LF, and Tab is *already* in SCSU. No conversion needed. I appreciate Philippe's support of SCSU, but I don't think *even I* would recommend it as an internal storage format. The effort to encode and decode it, while by no means Herculean as often perceived, is not trivial once you step outside Latin-1. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/