On 6/1/2017 8:32 PM, Richard Wordingham via Unicode wrote:
TUS Section 3 is like the Augean Stables.  It is a complete mess as a
standards document,

That is a matter of editorial taste, I suppose.

imputing mental states to computing processes.

That, however, is false. The rhetorical turn in the Unicode Standard's conformance clauses, "A process shall interpret..." and "A process shall not interpret..." has been in the standard for 21 years, and seems to have done its general job in guiding interoperable, conformant implementations fairly well. And everyone -- well, perhaps almost everyone -- has been able to figure out that such wording is a shorthand for something along the lines of "Any person implementing software conforming to the Unicode Standard in which a process does X shall implement it in such a way that that process when doing X shall follow the specification part Y, relevant to doing X, exactly according to that specification of Y...", rather than a misguided assumption that software processes are cognitive agents equipped with mental states that the standard can "tell what to think".

And I contend that the shorthand works just fine.


Table 3-7 for example, should be a consequence of a 'definition' that
UTF-8 only represents Unicode Scalar values and excludes 'non-shortest
forms'.

Well, Definition D92 does already explicitly limit UTF-8 to Unicode scalar values, and explicitly limits the form to sequences of one to four bytes. The reason why it doesn't explicitly include the exclusion of "non-shortest form" in the definition, but instead refers to Table 3-7 for the well-formed sequences (which, btw explicitly rule out all the non-shortest forms), is because that would create another terminological conundrum -- trying to specify an air-tight definition of "non-shortest form (of UTF-8)" before UTF-8 itself is defined. It is terminologically cleaner to let people *derive* non-shortest form from the explicit exclusions of Table 3-7.

Instead, the exclusion of the sequence <ED A0 80> is presented
as a brute definition, rather than as a consequence of 0xD800 not being
a Unicode scalar value. Likewise, 0xFC fails to be legal because it
would define either a 'non-shortest form' or a value that is not a
Unicode scalar value.

Actually 0xFC fails quite simply and unambiguously, because it is not in Table 3-7. End of story.

Same for 0xFF. There is nothing architecturally special about 0xF5..0xFF. All are simply and unambiguously excluded from any well-formed UTF-8 byte sequence.


The differences are a matter of presentation; the outcome as to what is
permitted is the same.  The difference lies rather in whether the rules
are comprehensible.  A comprehensible definition is more likely to be
implemented correctly.  Where the presentation makes a difference is in
how malformed sequences are naturally handled.

Well, I don't think implementers have all that much trouble figuring out what *well-formed* UTF-8 is these days.

As for "how malformed sequences are naturally handled", I can't really say. Nor do I think the standard actually requires any particular handling to be conformant. It says thou shalt not emit them, and if you encounter them, thou shalt not interpret them as Unicode characters. Beyond that, it would be nice, of course, if people converged their error handling for malformed sequences in cooperative ways, but there is no conformance statement to that effect in the standard.

I have no trouble with the contention that the wording about "best practice" and "recommendations" regarding the handling of U+FFFD has caused some confusion and differences of interpretation among implementers. I'm sure the language in that area could use cleanup, precisely because it has led to contending, incompatible interpretations of the text. As to what actually *is* best practice in use of U+FFFD when attempting to convert ill-formed sequences handed off to UTF-8 conversion processes, or whether the Unicode Standard should attempt to narrow down or change practice in that area, I am completely agnostic. Back to the U+FFFD thread for that discussion.

--Ken

Reply via email to