On 5/17/11 3:29 PM, Wes Garland wrote:
But the point remains, the FAQ entry you quote talks about encoding a
lone surrogate, i.e. a code unit, which is not a complete code point.
You can only convert complete code points from one encoding to another.
Just like you can't represent part of a UTF-8 code sub-sequence in any
other encoding. The fact that code point X is not representable in
UTF-16 has no bearing on its status as a code point, nor its
convertability to UTF-8.  The problem is that UTF-16 cannot represent
all possible code points.

My point is that neither can UTF-8. Can you name an encoding that _can_ represent the surrogate-range codepoints?

 From page 90 of the Unicode 6.0 specification, in the Conformance chapter:

    /D80 Unicode string:/ A code unit sequence containing code units of
    a particular Unicode
    encoding form.
    • In the rawest form, Unicode strings may be implemented simply as
    arrays of
    the appropriate integral data type, consisting of a sequence of code
    units lined
    up one immediately after the other.
    • A single Unicode string must contain only code units from a single
    Unicode
    encoding form. It is not permissible to mix forms within a string.



    Not sure what "(D80)" is supposed to mean.


Sorry, "(D80)" means "per definition D80 of The Unicode Standard,
Version 6.0"

Ah, ok. So the problem there is that this is definition only makes sense when a particular Unicode encoding form has been chosen. Which Unicode encoding form have we chosen here?

But note also that D76 in that same document says:

  Unicode scalar value: Any Unicode code point except high-surrogate
                        and low-surrogate code points.

and D79 says:

  A Unicode encoding form assigns each Unicode scalar value to a unique
  code unit sequence.

and

  To ensure that the mapping for a Unicode encoding form is
  one-to-one, all Unicode scalar values, including those
  corresponding to noncharacter code points and unassigned code
  points, must be mapped to unique code unit sequences. Note that
  this requirement does not extend to high-surrogate and
  low-surrogate code points, which are excluded by definition from
  the set of Unicode scalar values.

In particular, this makes it clear (to me, at least) that whatever Unicode encoding form you choose, a "Unicode string" can only consist of code units encoding Unicode scalar values, which does NOT include high and low surrogates.

Therefore I stand by my statement: if you allow what to me looks like arrays "UTF-32 code units and also values that fall into the surrogate ranges" then you don't get Unicode strings. You get a set of arrays that contains Unicode strings as a proper subset.

    OK, that seems like a breaking change.

Yes, I believe it would be, certainly if done naively, but I am hopeful
somebody can figure out how to overcome this.

As long as we worry about that _before_ enshrining the result in a spec, I'm all of being hopeful.

    Maybe, and maybe not.  We (Mozilla) have had some proposals to
    actually use UTF-8 throughout, including in the JS engine; it's
    quite possible to implement an API that looks like a 16-bit array on
    top of UTF-8 as long as you allow invalid UTF-8 that's needed to
    represent surrogates and the like.


I understand by this that in the Moz proposals, you mean that the
"invalid" UTF-8 sequences are actually valid UTF-8 Strings which encode
code points in the range 0xd800-0xdfff

There are no such valid UTF-8 strings; see spec quotes above. The proposal would have involved having invalid pseudo-UTF-ish strings.

and that these code points were
translated directly (and purposefully incorrectly) as UTF-16 code units
when viewed as 16-bit arrays.

Yep.

If JS Strings were arrays of Unicode code points, this conversion would
be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point
0xdc08, with no incorrect conversion taking place.

Sorry, no.  See above.

The only problem is
if there is an intermediate component somewhere that insists on using
UTF-16..at that point we just can't represent code point 0xdc08 at all.

I just don't get it. You can stick the invalid 16-bit value 0xdc08 into a "UTf-16" string just as easily as you can stick the invalid 24-bit sequence 0xed 0xb0 0x88 into a "UTF-8" string. Can you please, please tell me what made you decide there's _any_ difference between the two cases? They're equally invalid in _exactly_ the same way.

-Boris
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to