On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode <unicode@unicode.org> wrote: > I’m not sure how the discussion of “which is better” relates to the > discussion of ill-formed UTF-8 at all.
Clearly, the "which is better" issue is distracting from the underlying issue. I'll clarify what I meant on that point and then move on: I acknowledge that UTF-16 as the internal memory representation is the dominant design. However, because UTF-8 as the internal memory representation is *such a good design* (when legacy constraits permit) that *despite it not being the current dominant design*, I think the Unicode Consortium should be fully supportive of UTF-8 as the internal memory representation and not treat UTF-16 as the internal representation as the one true way of doing things that gets considered when speccing stuff. I.e. I wasn't arguing against UTF-16 as the internal memory representation (for the purposes of this thread) but trying to motivate why the Consortium should consider "UTF-8 internally" equally despite it not being the dominant design. So: When a decision could go either way from the "UTF-16 internally" perspective, but one way clearly makes more sense from the "UTF-8 internally" perspective, the "UTF-8 internally" perspective should be decisive in *such a case*. (I think the matter at hand is such a case.) At the very least a proposal should discuss the impact on the "UTF-8 internally" case, which the proposal at hand doesn't do. (Moving on to a different point.) The matter at hand isn't, however, a new green-field (in terms of implementations) issue to be decided but a proposed change to a standard that has many widely-deployed implementations. Even when observing only "UTF-16 internally" implementations, I think it would be appropriate for the proposal to include a review of what existing implementations, beyond ICU, do. Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick test with three major browsers that use UTF-16 internally and have independent (of each other) implementations of UTF-8 decoding (Firefox, Edge and Chrome) shows agreement on the current spec: there is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, 6 on the second, 4 on the third and 6 on the last line). Changing the Unicode standard away from that kind of interop needs *way* better rationale than "feels right". -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/