On 5/17/11 3:29 PM, Wes Garland wrote:
But the point remains, the FAQ entry you quote talks about encoding a
lone surrogate, i.e. a code unit, which is not a complete code point.
You can only convert complete code points from one encoding to another.
Just like you can't represent part of a UTF-8 code sub-sequence in any
other encoding. The fact that code point X is not representable in
UTF-16 has no bearing on its status as a code point, nor its
convertability to UTF-8. The problem is that UTF-16 cannot represent
all possible code points.
My point is that neither can UTF-8. Can you name an encoding that _can_
represent the surrogate-range codepoints?
From page 90 of the Unicode 6.0 specification, in the Conformance chapter:
/D80 Unicode string:/ A code unit sequence containing code units of
a particular Unicode
encoding form.
• In the rawest form, Unicode strings may be implemented simply as
arrays of
the appropriate integral data type, consisting of a sequence of code
units lined
up one immediately after the other.
• A single Unicode string must contain only code units from a single
Unicode
encoding form. It is not permissible to mix forms within a string.
Not sure what "(D80)" is supposed to mean.
Sorry, "(D80)" means "per definition D80 of The Unicode Standard,
Version 6.0"
Ah, ok. So the problem there is that this is definition only makes
sense when a particular Unicode encoding form has been chosen. Which
Unicode encoding form have we chosen here?
But note also that D76 in that same document says:
Unicode scalar value: Any Unicode code point except high-surrogate
and low-surrogate code points.
and D79 says:
A Unicode encoding form assigns each Unicode scalar value to a unique
code unit sequence.
and
To ensure that the mapping for a Unicode encoding form is
one-to-one, all Unicode scalar values, including those
corresponding to noncharacter code points and unassigned code
points, must be mapped to unique code unit sequences. Note that
this requirement does not extend to high-surrogate and
low-surrogate code points, which are excluded by definition from
the set of Unicode scalar values.
In particular, this makes it clear (to me, at least) that whatever
Unicode encoding form you choose, a "Unicode string" can only consist of
code units encoding Unicode scalar values, which does NOT include high
and low surrogates.
Therefore I stand by my statement: if you allow what to me looks like
arrays "UTF-32 code units and also values that fall into the surrogate
ranges" then you don't get Unicode strings. You get a set of arrays
that contains Unicode strings as a proper subset.
OK, that seems like a breaking change.
Yes, I believe it would be, certainly if done naively, but I am hopeful
somebody can figure out how to overcome this.
As long as we worry about that _before_ enshrining the result in a spec,
I'm all of being hopeful.
Maybe, and maybe not. We (Mozilla) have had some proposals to
actually use UTF-8 throughout, including in the JS engine; it's
quite possible to implement an API that looks like a 16-bit array on
top of UTF-8 as long as you allow invalid UTF-8 that's needed to
represent surrogates and the like.
I understand by this that in the Moz proposals, you mean that the
"invalid" UTF-8 sequences are actually valid UTF-8 Strings which encode
code points in the range 0xd800-0xdfff
There are no such valid UTF-8 strings; see spec quotes above. The
proposal would have involved having invalid pseudo-UTF-ish strings.
and that these code points were
translated directly (and purposefully incorrectly) as UTF-16 code units
when viewed as 16-bit arrays.
Yep.
If JS Strings were arrays of Unicode code points, this conversion would
be a non-issue; UTF-8 sequence 0xed 0xb0 0x88 becomes Unicode code point
0xdc08, with no incorrect conversion taking place.
Sorry, no. See above.
The only problem is
if there is an intermediate component somewhere that insists on using
UTF-16..at that point we just can't represent code point 0xdc08 at all.
I just don't get it. You can stick the invalid 16-bit value 0xdc08 into
a "UTf-16" string just as easily as you can stick the invalid 24-bit
sequence 0xed 0xb0 0x88 into a "UTF-8" string. Can you please, please
tell me what made you decide there's _any_ difference between the two
cases? They're equally invalid in _exactly_ the same way.
-Boris
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss