Re: Worst case scenarios on SCSU

DougEwell2 Wed, 31 Oct 2001 22:34:35 -0800

It must be a full moon on Halloween, because here I am in the extremely 
unfamiliar position of disagreeing quite strongly with Ken Whistler.

In a message dated 2001-10-31 17:16:25 Pacific Standard Time, [EMAIL PROTECTED] 
writes:

>  As current Czar of Names Rectification, I must start protesting
>  here. SCSU is a means of *compressing* Unicode text. It is
>  not "[an]other method of encoding Unicode characters."

I was about to reply, "Of course it is," before I realized that Ken was 
interpreting the word "encoding" in the strictest sense, invoking the 
distinction between character encoding forms (CEFs) and transfer encoding 
syntaxes (TESs).  In some cases this is a worthwhile distinction, but I don't 
think it is relevant in the case of David's query, or, for that matter, in 
many other cases where users may think of Unicode text being "represented" as 
UTF-32, UTF-16, UTF-8, SCSU, ASCII with UCN sequences, or even (God forbid) 
CESU-8.

SCSU is indeed another method of "representing" Unicode characters, if not 
necessarily "encoding" them in the strict sense of the word.

>  And before going on, I'm not clear exactly what you are
>  trying to do. SCSU is defined on UTF-16 text. It would, of
>  course, be possible to create SCSU-like windowing compression
>  schemes that would work on UTF-32 or UTF-8 text, but those are
>  not part of UTS #6 as it is currently written.

Like David, I don't see how SCSU is defined on, or limited to, UTF-16 text, 
except in the sense that literal or quoted "Unicode-mode" SCSU text is 
UTF-16.  SCSU is defined on Unicode scalar values, which are not tied to a 
particular CEF.

You can define an window in what SCSU calls "the expansion space" using the 
SDX or UDX tag and, in the best case, store N characters of Gothic or Deseret 
text in N + 3 bytes.  None of this has anything to do with surrogates or 
16-bitness.

In a message dated 2001-10-31 17:59:33 Pacific Standard Time, [EMAIL PROTECTED] 
writes:

>  I have no quarrel with the claim that the SCSU scheme could be
>  implemented directly on UTF-32 data. But as Unicode Technical Standard
>  #6 is currently written, that is not how to do it conformantly.

I have looked throughout UTS #6 and cannot find anything, explicit or 
implicit, to the effect that SCSU could not be conformantly implemented 
against UTF-32 data.  Sections 6.1.3 and 8.1 refer to how "surrogate pairs" 
may be encoded (*) in SCSU, but if you substitute the phrase "non-BMP 
characters" the meaning is identical.

(*) The word "encoded" was taken directly from UTS #6, section 8.1.

>  At the moment, if you want to compare SCSU-compressed text
>  against the UTF-32 form, you would have to convert the UTF-32
>  text to UTF-16, and then compress it using SCSU. You don't
>  apply SCSU directly to UTF-32 data.

Why not?  The fact that UTS #6 was originally written before UTF-32 was 
formally defined has nothing to do with this.  The same could be said for 
UTF-8, which (like SCSU) has a surrogate-free mechanism for representing 
non-BMP characters.

>  It seems to me that a rewrite of SCSU would be in order to explicitly
>  allow and define UTF-32 implementations as well as UTF-16 implementations
>  of SCSU.

I don't see anything that needs rewriting.  What are you seeing?

-Doug Ewell
 Fullerton, California

Re: Worst case scenarios on SCSU

Reply via email to