On Mon, May 15, 2017 at 11:50 PM, Henri Sivonen via Unicode < unicode@unicode.org> wrote:
> On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode > <unicode@unicode.org> wrote: > > I’m not sure how the discussion of “which is better” relates to the > > discussion of ill-formed UTF-8 at all. > > Clearly, the "which is better" issue is distracting from the > underlying issue. I'll clarify what I meant on that point and then > move on: > > I acknowledge that UTF-16 as the internal memory representation is the > dominant design. However, because UTF-8 as the internal memory > representation is *such a good design* (when legacy constraits permit) > that *despite it not being the current dominant design*, I think the > Unicode Consortium should be fully supportive of UTF-8 as the internal > memory representation and not treat UTF-16 as the internal > representation as the one true way of doing things that gets > considered when speccing stuff. > > I.e. I wasn't arguing against UTF-16 as the internal memory > representation (for the purposes of this thread) but trying to > motivate why the Consortium should consider "UTF-8 internally" equally > despite it not being the dominant design. > > So: When a decision could go either way from the "UTF-16 internally" > perspective, but one way clearly makes more sense from the "UTF-8 > internally" perspective, the "UTF-8 internally" perspective should be > decisive in *such a case*. (I think the matter at hand is such a > case.) > > At the very least a proposal should discuss the impact on the "UTF-8 > internally" case, which the proposal at hand doesn't do. > > (Moving on to a different point.) > > The matter at hand isn't, however, a new green-field (in terms of > implementations) issue to be decided but a proposed change to a > standard that has many widely-deployed implementations. Even when > observing only "UTF-16 internally" implementations, I think it would > be appropriate for the proposal to include a review of what existing > implementations, beyond ICU, do. > > Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick > test with three major browsers that use UTF-16 internally and have > independent (of each other) implementations of UTF-8 decoding > (Firefox, Edge and Chrome) Something I've learned through working with Node (V8 javascript engine from chrome) V8 stores strings either as UTF-16 OR UTF-8 interchangably and is not one OR the other... https://groups.google.com/forum/#!topic/v8-users/wmXgQOdrwfY and I wouldn't really assume UTF-16 is a 'majority'; Go is utf-8 for instance. > shows agreement on the current spec: there > is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, > 6 on the second, 4 on the third and 6 on the last line). Changing the > Unicode standard away from that kind of interop needs *way* better > rationale than "feels right". > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ > >