Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

Ken Whistler via Unicode Thu, 01 Jun 2017 19:25:51 -0700


On 6/1/2017 6:21 PM, Richard Wordingham via Unicode wrote:

By definition D39b, either sequence of bytes, if encountered by an
conformant UTF-8 conversion process, would be interpreted as a
sequence of 6 maximal subparts of an ill-formed subsequence.

("D39b" is a typo for "D93b".)


Sorry about that. :)


Conformant with what?  There is no mandatory*requirement*  for a UTF-8
conversion process conformant with Unicode to have any concept of
'maximal subpart'.

Conformant with the definition of UTF-8. I agree that nothing forces aconversion *process* to care anything about maximal subparts, but if*any* process using a conformant definition of UTF-8 then goes on tohave any concept of "maximal subpart of an ill-formed subsequence" thatdeparts from definition D93b in the Unicode Standard, then it is justmaking s**t up.

I don't see a good reason to build in special logic to treat FC 80 80
80 80 80 as somehow privileged as a unit for conversion fallback,
simply because*if*  UTF-8 were defined as the Unix gods intended
(which it ain't no longer) then that sequence*could*  be interpreted
as an out-of-bounds scalar value (which it ain't) on spec that the
codespace*might*  be extended past 10FFFF at some indefinite time in
the future (which it won't).

Arguably, it requires special logic to treat FC 80 80 80 80 80 as an
invalid sequence.

That would be equally true of FF FF FF FF FF FF. Which was my point,actually.

   FC is not ASCII,

True, of course. But irrelevant. Because we are talking about UTF-8here. And just because some non-UTF-8 character encoding happened toinclude 0xFC as a valid (or invalid) value, might not require anyspecial case processing. A simple 8-bit to 8-bit conversion table couldbe completely regular in its processing of 0xFC for a conversion.

  and has more than one leading bit
set.  It has the six leading bits set,


True, of course.

  and therefore should start a
sequence of 6 characters.

That is completely false, and has nothing to do with the currentdefinition of UTF-8.

The current, normative definition of UTF-8, in the Unicode Standard, andin ISO/IEC 10646:2014, and in RFC 3629 (which explicitly "obsoletes andreplaces RFC 2279") states clearly that 0xFC cannot start a sequence ofanything identifiable as UTF-8.


--Ken


Richard.

Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

Reply via email to