On 6/1/2017 6:21 PM, Richard Wordingham via Unicode wrote:
By definition D39b, either sequence of bytes, if encountered by an
conformant UTF-8 conversion process, would be interpreted as a
sequence of 6 maximal subparts of an ill-formed subsequence.
("D39b" is a typo for "D93b".)

Sorry about that. :)


Conformant with what?  There is no mandatory*requirement*  for a UTF-8
conversion process conformant with Unicode to have any concept of
'maximal subpart'.

Conformant with the definition of UTF-8. I agree that nothing forces a conversion *process* to care anything about maximal subparts, but if *any* process using a conformant definition of UTF-8 then goes on to have any concept of "maximal subpart of an ill-formed subsequence" that departs from definition D93b in the Unicode Standard, then it is just making s**t up.


I don't see a good reason to build in special logic to treat FC 80 80
80 80 80 as somehow privileged as a unit for conversion fallback,
simply because*if*  UTF-8 were defined as the Unix gods intended
(which it ain't no longer) then that sequence*could*  be interpreted
as an out-of-bounds scalar value (which it ain't) on spec that the
codespace*might*  be extended past 10FFFF at some indefinite time in
the future (which it won't).
Arguably, it requires special logic to treat FC 80 80 80 80 80 as an
invalid sequence.

That would be equally true of FF FF FF FF FF FF. Which was my point, actually.

   FC is not ASCII,

True, of course. But irrelevant. Because we are talking about UTF-8 here. And just because some non-UTF-8 character encoding happened to include 0xFC as a valid (or invalid) value, might not require any special case processing. A simple 8-bit to 8-bit conversion table could be completely regular in its processing of 0xFC for a conversion.

  and has more than one leading bit
set.  It has the six leading bits set,

True, of course.

  and therefore should start a
sequence of 6 characters.

That is completely false, and has nothing to do with the current definition of UTF-8.

The current, normative definition of UTF-8, in the Unicode Standard, and in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly "obsoletes and replaces RFC 2279") states clearly that 0xFC cannot start a sequence of anything identifiable as UTF-8.

--Ken


Richard.


Reply via email to