Re: Worst case scenarios on SCSU

Markus Scherer Fri, 02 Nov 2001 10:05:26 -0800

Dear fellow SCSU enthusiasts!

SCSU wanders wondrous worlds between CES (Character Encoding Scheme) and TES (Transfer 
Encoding Syntax).
But, few people care - it is a way to get Unicode into and out of a byte stream, and 
as such qualifies as a "charset" as used in Internet protocols. (A charset is defined 
as a method to get text _out_ of a byte stream.)


SCSU is registered as an IANA charset.

In the ICU implementation of the SCSU converter, I believe the worst case is 3 bytes 
per 16-bit code unit (UTF-16). It actually gets really close to the compressions of 
the samples in UTS 6, but it is limited mostly because we allow buffering with 
arbitrarily small input/output buffer sizes. We can not assume that we will see the 
entire text at once - or more than a byte/code unit at a time. Still, it works quite 
well though not optimal, and I tried to write it for good performance.

As a theoretical maximum for the output length, the answer is of course "unlimited" 
for pathological converters. This is because you can write an arbitrary number of 
useless state changes, like SC0 SC0 SC0 ... without encoding anything.

markus

Re: Worst case scenarios on SCSU

Reply via email to