On Wed, Oct 31, 2001 at 05:04:44PM -0800, Kenneth Whistler wrote: > And before going on, I'm not clear exactly what you are > trying to do. SCSU is defined on UTF-16 text.
Why do you say that? I can't find the phrase "UTF-16" in UTS-6. It's says that it's "a compression scheme for Unicode" and that "[SCSU] is mainly intended for use with short to medium length Unicode strings.". I noticed that the sample strings are in UTF-16, and count surrogate pairs as two characters (I think; for 9.4, I count 17 characters counting pairs as 1 and 19 as two, whereas the text claims 20), but I that's merely informative anyway. All the SCSU pieces I've written work directly from UTF-32. I'll admit I haven't done much checking with other encoders/decoders, but my decoder can handle all the sample strings correctly, as well as every thing my encoders put out. > > UTF-32: Since all characters (including any necessary state changes) > > can be encoded in four characters, and four characters would be > ^bytes ^bytes Yes, sorry. > I don't understand this analysis. The worst case for SCSU is always > UTF-16 length + 1 byte. This because if any garden path down the > heuristics leads to further expansions, you can always represent the > text as: > > SCU + (the rest of the text in Unicode) Section 5.2.1: "Each reserved tag value collides with 256 Unicode characters." If you do that and have private use values in your UTF-16 string, decoding the SCSU will produce a different text. > Here, you are saying that if I have a UTF-8 string 0x01 0x01 0x01 0x01... > I'd have to represent it in SCSU as 0x0F 0x00 0x01 0x00 0x01 0x00 0x01...? > (Actually NULs themselves would not be a problem, since they are passed > as single bytes 0x00.) Right. I was thinking of SQ0 0x01 SQ0 0x01 . . . but it's the same idea. -- David Starner - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org "I saw a daemon stare into my face, and an angel touch my breast; each one softly calls my name . . . the daemon scares me less." - "Disciple", Stuart Davis

