Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Doug Ewell via Unicode
Richard Wordingham wrote: >> It is not at all clear what the intent of the encoder was - or even >> if it's not just a problem with the data stream. E0 80 80 is not >> permitted, it's garbage. An encoder can't "intend" it. > > It was once a legal way of encoding NUL, just like C0 E0, which is >

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Doug Ewell via Unicode
Richard Wordingham wrote: > So it was still a legal way for a non-UTF-8-compliant process! Anything is possible if you are non-compliant. You can encode U+263A with 9,786 FF bytes followed by a terminating FE byte and call that "UTF-8," if you are willing to be non-compliant enough. > Note for

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Doug Ewell via Unicode
Hans Åberg wrote: >> Far from solving the stated problem, it would introduce a new one: >> conversion from the "bad data" Unicode code points, currently >> well-defined, would become ambiguous. > > Actually not: just translate the invalid UTF-8 sequences into invalid > UTF-32. Far from solving

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Richard Wordingham via Unicode
On Wed, 17 May 2017 13:37:51 -0700 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > >> It is not at all clear what the intent of the encoder was - or even > >> if it's not just a problem with the data stream. E0 80 80 is not > >> permitted, it's garbage. An

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Asmus Freytag via Unicode
On 5/17/2017 2:31 PM, Richard Wordingham via Unicode wrote: There's some sort of rule that proposals should be made seven days in advance of the meeting. I can't find it now, so I'm not sure whether the actual rule was followed, let alone what authority it has.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Hans Åberg via Unicode
> On 17 May 2017, at 23:18, Doug Ewell wrote: > > Hans Åberg wrote: > >>> Far from solving the stated problem, it would introduce a new one: >>> conversion from the "bad data" Unicode code points, currently >>> well-defined, would become ambiguous. >> >> Actually not: just

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Richard Wordingham via Unicode
On Wed, 17 May 2017 15:31:56 -0700 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > > So it was still a legal way for a non-UTF-8-compliant process! > > Anything is possible if you are non-compliant. You can encode U+263A > with 9,786 FF bytes followed by a

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Doug Ewell via Unicode
Hans Åberg wrote: > It would be useful, for use with filesystems, to have Unicode > codepoint markers that indicate how UTF-8, including non-valid > sequences, is translated into UTF-32 in a way that the original > octet sequence can be restored. I have always argued strongly against this idea,

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Doug Ewell via Unicode
Henri Sivonen wrote: > I find it shocking that the Unicode Consortium would change a > widely-implemented part of the standard (regardless of whether Unicode > itself officially designates it as a requirement or suggestion) on > such flimsy grounds. > > I'd like to register my feedback that I

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Hans Åberg via Unicode
> On 17 May 2017, at 22:36, Doug Ewell via Unicode wrote: > > Hans Åberg wrote: > >> It would be useful, for use with filesystems, to have Unicode >> codepoint markers that indicate how UTF-8, including non-valid >> sequences, is translated into UTF-32 in a way that the

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Richard Wordingham via Unicode
On Wed, 17 May 2017 13:41:56 -0700 Doug Ewell via Unicode wrote: > Perhaps surprisingly, it's already too late. UTC approved this change > the day after the proposal was written. > > http://www.unicode.org/L2/L2017/17103.htm#151-C19 Approved for Unicode 11.0. Unicode 10.0

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Doug Ewell via Unicode
Richard Wordingham wrote: I'm afraid I don't get the analogy. You can't build a full Unicode system out of Unicode-compliant parts. Others will have to address Richard's point about canonical-equivalent sequences. However, having dug out Unicode Version 2 Appendix A Section 2 UTF-8 (in

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Richard Wordingham via Unicode
On Thu, 18 May 2017 02:04:55 +0200 Philippe Verdy via Unicode wrote: > I find intriguating that the update intends to enforce the decoding > of the **shortest** sequences, but now wants to treat **maximal > sequences** as a single unit with arbitrary length. UTF-8 was >

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Philippe Verdy via Unicode
I find intriguating that the update intends to enforce the decoding of the **shortest** sequences, but now wants to treat **maximal sequences** as a single unit with arbitrary length. UTF-8 was designed to work only with some state machines that would NEVER need to parse more than 4 bytes. For

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 9:36 PM, Markus Scherer wrote: > Let me try to address some of the issues raised here. Thank you. > The proposal changes a recommendation, not a requirement. This is a very bad reason in favor of the change. If anything, this should be a reason why

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Alastair Houghton via Unicode
> On 16 May 2017, at 20:43, Richard Wordingham via Unicode > wrote: > > On Tue, 16 May 2017 11:36:39 -0700 > Markus Scherer via Unicode wrote: > >> Why do we care how we carve up an illegal sequence into subsequences? >> Only for debugging and visual