Re: [bitc-dev] BitC 0.20: Unicode

Jonathan S. Shapiro Wed, 10 Mar 2010 14:56:40 -0800

On Wed, Mar 10, 2010 at 1:47 PM, Raoul Duke <[email protected]> wrote:


> i think the use of "string" is overloaded and confuses me, if not
> anybody else. i'd say there is no unit smaller than a sequence.

String is definitely overloaded, because it has interpretations at
multiple levels of abstraction. Assuming that a unicode string is
well-formed, it can be viewed simultaneously as:

1. A sequence of code units of size UCS-1 (UTF8 encoding).
2. A sequence of code units of size UCS-2 (which *may* be UTF16 encoding).
3. A sequence of code units of size UCS-4 (UTF32 encoding).
4. A sequence of code points. In unicode, all code points can be
represented as UCS-4, so this is equivalent to [3].

Additionally, if the unicode normalization of the string is known, it
can be viewed as:

5. A sequence of characters, subject to the assumption that the choice
of normalization within the string is known. Note that there is no
fixed-size scalar representation for characters, so operations on
characters tend to be viewed in Unicode as operations from strings to
strings.


Which position the user perceives is largely a consequence of how the
default method of string indexing is defined. This in turn tends to
drive the choice of representation.

CLR/JVM System.string takes the position:
    - UCS-2 code units are called "char"
    - The default indexing operator s[] operates according to
interpretation [2] above
    - The representation-level encoding is not specified, but is
commonly "sequence of UCS-2"
    - Because the string construction mechanisms in C#/JVM do not guarantee that
      a string is well-formed, strings in C#/JVM are "sequence of
UCS-2 code units", as
      opposed to "sequence of code points encoded via UTF-16 into
UCS-2 code units".

BitC 0.10 strings take the position:
    - Code Points (which are 1:1 with UCS-4 code units) are called "char".
    - The default indexing operator s[] operates according to
interpretation [4],
      which is operationally interchangeable with interpretation [3].
    - The representation-level encoding is not specified, but in the
current implementation
      is UTF-8. BitC does not expose a code unit indexing operation,
so this representation
      is not part of the contract.
    - A BitC string is well-formed in the sense that it consists
semantically of a sequence of
      code units.

Neither system speaks to the question of unicode normalization
requirements, so neither sequence can correctly perform indexing or
selection over characters.
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] BitC 0.20: Unicode

Reply via email to