Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Karl Williamson via Unicode
On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote: In reference to: http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf I think Unicode should not adopt the proposed change. The proposal is to make ICU's spec violation conforming. I think there is both a technical and a political

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Philippe Verdy via Unicode
Softwares designed with only UCS-2 and not real UTF-16 support are still used today For example MySQL with its broken "UTF-8" encoding which in fact encodes supplementary characters as two separate 16-bit code-units for surrogates, each one blindly encoded as 3-byte sequences which would be

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Philippe Verdy via Unicode
2017-05-15 19:54 GMT+02:00 Asmus Freytag via Unicode : > I think this political reason should be taken very seriously. There are > already too many instances where ICU can be seen "driving" the development > of property and algorithms. > > Those involved in the ICU project

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread David Starner via Unicode
On Mon, May 15, 2017 at 8:41 AM Alastair Houghton via Unicode < unicode@unicode.org> wrote: > Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the > case for other situations UTF-8 is clearly more efficient space-wise that includes more ASCII characters than characters

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Asmus Freytag via Unicode
On 5/15/2017 11:33 AM, Henri Sivonen via Unicode wrote: ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't representative of implementation concerns of implementations that use UTF-8 as their in-memory Unicode representation. Even though there are notable systems (Win32,

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Shawn Steele via Unicode
>> Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting >> multiple errors there makes no sense. > > Changing a specification as fundamental as this is something that should not > be undertaken lightly. IMO, the only think that can be agreed upon is that "something's bad

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Henri Sivonen via Unicode
On Mon, May 15, 2017 at 6:37 PM, Alastair Houghton wrote: > On 15 May 2017, at 11:21, Henri Sivonen via Unicode > wrote: >> >> In reference to: >> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >> >> I think Unicode should not adopt

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Alastair Houghton via Unicode
On 15 May 2017, at 18:52, Asmus Freytag wrote: > > On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote: >> On 15 May 2017, at 11:21, Henri Sivonen via Unicode >> wrote: >>> In reference to: >>>

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Asmus Freytag via Unicode
On 5/15/2017 3:21 AM, Henri Sivonen via Unicode wrote: Second, the political reason: Now that ICU is a Unicode Consortium project, I think the Unicode Consortium should be particular sensitive to biases arising from being both the source of the spec and the source of a popular implementation.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Asmus Freytag via Unicode
On 5/15/2017 8:37 AM, Alastair Houghton via Unicode wrote: On 15 May 2017, at 11:21, Henri Sivonen via Unicode wrote: In reference to: http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf I think Unicode should not adopt the proposed change. Disagree. An over-long

Re: Are Emoji ZWJ sequences characters?

2017-05-15 Thread Richard Wordingham via Unicode
On Mon, 15 May 2017 16:14:23 + Peter Constable via Unicode wrote: > So, your helpful person was, indeed, helpful, giving you correct > information: ZWJ sequences are not _characters_ and have no > implications for ISO/IEC 10646. Except in so far as the claimed ligature

RE: Are Emoji ZWJ sequences characters?

2017-05-15 Thread Peter Constable via Unicode
Emoji sequences are not _encoded_, per se, in either Unicode or ISO/IEC 10646. The act of "encoding" in either of these coding standards is to assign an encoded representation in the encoding method of the standards for a given entity. In this case, that means to assign a code point.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Alastair Houghton via Unicode
On 15 May 2017, at 11:21, Henri Sivonen via Unicode wrote: > > In reference to: > http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf > > I think Unicode should not adopt the proposed change. Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting

Are Emoji ZWJ sequences characters?

2017-05-15 Thread William_J_G Overington via Unicode
I am concerned about emoji ZWJ sequences being encoded without going through the ISO process and whether Unicode will therefore lose synchronization with ISO/IEC 10646. I have raised this by email and a very helpful person has advised me that encoding emoji sequences does not mean that Unicode

Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-15 Thread Henri Sivonen via Unicode
In reference to: http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf I think Unicode should not adopt the proposed change. The proposal is to make ICU's spec violation conforming. I think there is both a technical and a political reason why the proposal is a bad idea. First, the technical