Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Richard Wordingham via Unicode
On Wed, 31 May 2017 19:24:04 + Shawn Steele via Unicode wrote: > It seems to me that being able to use a data stream of ambiguous > quality in another application with predictable results, then that > stream should be “repaired” prior to being handed over. Then both > endpoints would be usin

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Richard Wordingham via Unicode
On Wed, 31 May 2017 17:43:08 + Shawn Steele via Unicode wrote: > There also appears to be a special weight given to > non-minimally-encoded sequences. It would seem to me that none of > these illegal sequences should appear in practice, so we have either: > I do not understand the energy

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> And *that* is what the specification says. The whole problem here is that > someone elevated > one choice to the status of “best practice”, and it’s a choice that some of > us don’t think *should* > be considered best practice. > Perhaps “best practice” should simply be altered to say that yo

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Doug Ewell via Unicode
Henri Sivonen wrote: > If anything, I hope this thread results in the establishment of a > requirement for proposals to come with proper research about what > multiple prominent implementations to about the subject matter of a > proposal concerning changes to text about implementation behavior. C

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> it’s more meaningful for whoever sees the output to see a single U+FFFD > representing > the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid > lead byte and > then another for an “unexpected” trailing byte. I disagree. It may be more meaningful for some applications

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> For implementations that emit FFFD while handling text conversion and repair > (ie, converting ill-formed > UTF-8 to well-formed), it is best for interoperability if they get the same > results, so that indices within the > resulting strings are consistent across implementations for all the cor

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Mark Davis ☕️ via Unicode
> I do not understand the energy being invested in a case that shouldn't happen, especially in a case that is a subset of all the other bad cases that could happen. I think Richard stated the most compelling reason: … The bug you mentioned arose from two different ways of counting the string leng

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Alastair Houghton via Unicode
On 31 May 2017, at 18:43, Shawn Steele via Unicode wrote: > > It is unclear to me what the expected behavior would be for this corruption > if, for example, there were merely a half dozen 0x80 in the middle of ASCII > text? Is that garbage a single "character"? Perhaps because it's a > conse

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Alastair Houghton via Unicode
> On 30 May 2017, at 18:11, Shawn Steele via Unicode > wrote: > >> Which is to completely reverse the current recommendation in Unicode 9.0. >> While I agree that this might help you fending off a bug report, it would >> create chances for bug reports for Ruby, Python3, many if not all Web >

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> > In either case, the bad characters are garbage, so neither approach is > > "better" - except that one or the other may be more conducive to the > > requirements of the particular API/application. > There's a potential issue with input methods that indirectly edit the backing > store. For e

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Richard Wordingham via Unicode
On Wed, 31 May 2017 15:12:12 +0300 Henri Sivonen via Unicode wrote: > The write-up mentions > https://bugs.chromium.org/p/chromium/issues/detail?id=662822#c13 . I'd > like to draw everyone's attention to that bug, which is real-world > evidence of a bug arising from two UTF-8 decoders within one

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Henri Sivonen via Unicode
I've researched this more. While the old advice dominates the handling of non-shortest forms, there is more variation than I previously thought when it comes to truncated sequences and CESU-8-style surrogates. Still, the ICU behavior is an outlier considering the set of implementations that I teste