Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

Ken Whistler via Unicode Thu, 01 Jun 2017 21:58:12 -0700


On 6/1/2017 8:32 PM, Richard Wordingham via Unicode wrote:

TUS Section 3 is like the Augean Stables.  It is a complete mess as a
standards document,


That is a matter of editorial taste, I suppose.

imputing mental states to computing processes.

That, however, is false. The rhetorical turn in the Unicode Standard'sconformance clauses, "A process shall interpret..." and "A process shallnot interpret..." has been in the standard for 21 years, and seems tohave done its general job in guiding interoperable, conformantimplementations fairly well. And everyone -- well, perhaps almosteveryone -- has been able to figure out that such wording is a shorthandfor something along the lines of "Any person implementing softwareconforming to the Unicode Standard in which a process does X shallimplement it in such a way that that process when doing X shall followthe specification part Y, relevant to doing X, exactly according to thatspecification of Y...", rather than a misguided assumption that softwareprocesses are cognitive agents equipped with mental states that thestandard can "tell what to think".


And I contend that the shorthand works just fine.


Table 3-7 for example, should be a consequence of a 'definition' that
UTF-8 only represents Unicode Scalar values and excludes 'non-shortest
forms'.

Well, Definition D92 does already explicitly limit UTF-8 to Unicodescalar values, and explicitly limits the form to sequences of one tofour bytes. The reason why it doesn't explicitly include the exclusionof "non-shortest form" in the definition, but instead refers to Table3-7 for the well-formed sequences (which, btw explicitly rule out allthe non-shortest forms), is because that would create anotherterminological conundrum -- trying to specify an air-tight definition of"non-shortest form (of UTF-8)" before UTF-8 itself is defined. It isterminologically cleaner to let people *derive* non-shortest form fromthe explicit exclusions of Table 3-7.

Instead, the exclusion of the sequence <ED A0 80> is presented
as a brute definition, rather than as a consequence of 0xD800 not being
a Unicode scalar value. Likewise, 0xFC fails to be legal because it
would define either a 'non-shortest form' or a value that is not a
Unicode scalar value.

Actually 0xFC fails quite simply and unambiguously, because it is not inTable 3-7. End of story.

Same for 0xFF. There is nothing architecturally special about0xF5..0xFF. All are simply and unambiguously excluded from anywell-formed UTF-8 byte sequence.


The differences are a matter of presentation; the outcome as to what is
permitted is the same.  The difference lies rather in whether the rules
are comprehensible.  A comprehensible definition is more likely to be
implemented correctly.  Where the presentation makes a difference is in
how malformed sequences are naturally handled.

Well, I don't think implementers have all that much trouble figuring outwhat *well-formed* UTF-8 is these days.

As for "how malformed sequences are naturally handled", I can't reallysay. Nor do I think the standard actually requires any particularhandling to be conformant. It says thou shalt not emit them, and if youencounter them, thou shalt not interpret them as Unicode characters.Beyond that, it would be nice, of course, if people converged theirerror handling for malformed sequences in cooperative ways, but there isno conformance statement to that effect in the standard.

I have no trouble with the contention that the wording about "bestpractice" and "recommendations" regarding the handling of U+FFFD hascaused some confusion and differences of interpretation amongimplementers. I'm sure the language in that area could use cleanup,precisely because it has led to contending, incompatible interpretationsof the text. As to what actually *is* best practice in use of U+FFFDwhen attempting to convert ill-formed sequences handed off to UTF-8conversion processes, or whether the Unicode Standard should attempt tonarrow down or change practice in that area, I am completely agnostic.Back to the U+FFFD thread for that discussion.


--Ken

Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

Reply via email to