Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-09-23 Thread Markus Scherer via Unicode
FYI, I changed the ICU behavior for the upcoming ICU 60 release (pending code review). Proposal & description: https://sourceforge.net/p/icu/mailman/message/35990833/ Code changes: http://bugs.icu-project.org/trac/review/13311 Best regards, markus On Thu, Aug 3, 2017 at 5:34 PM, Mark Davis ☕️

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-08-05 Thread Martin J. Dürst via Unicode
Hello Mark, On 2017/08/04 09:34, Mark Davis ☕️ wrote: FYI, the UTC retracted the following. Thanks for letting us know! Regards, Martin. *[151-C19 ] Consensus:* Modify the section on "Best Practices for Using FFFD" in section "3.9

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-08-04 Thread Henri Sivonen via Unicode
On Fri, Aug 4, 2017 at 3:34 AM, Mark Davis ☕️ via Unicode wrote: > FYI, the UTC retracted the following. > > [151-C19] Consensus: Modify the section on "Best Practices for Using FFFD" > in section "3.9 Encoding Forms" of TUS per the recommendation in L2/17-168, > for Unicode

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-08-03 Thread Mark Davis ☕️ via Unicode
FYI, the UTC retracted the following. *[151-C19 ] Consensus:* Modify the section on "Best Practices for Using FFFD" in section "3.9 Encoding Forms" of TUS per the recommendation in L2/17-168

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-29 Thread Henri Sivonen via Unicode
On Sat Jun 3 23:09:01 CDT 2017Sat Jun 3 23:09:01 CDT 2017 Markus Scherer wrote: > I suggest you submit a write-up via http://www.unicode.org/reporting.html > > and make the case there that you think the UTC should retract > > http://www.unicode.org/L2/L2017/17103.htm#151-C19 The submission has

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-03 Thread Markus Scherer via Unicode
On Wed, May 31, 2017 at 5:12 AM, Henri Sivonen wrote: > On Sun, May 21, 2017 at 7:37 PM, Mark Davis ☕️ via Unicode > wrote: > > There is plenty of time for public comment, since it was targeted at > Unicode > > 11, the release for about a year from

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-02 Thread Alastair Houghton via Unicode
On 1 Jun 2017, at 19:44, Asmus Freytag via Unicode wrote: > > What's not OK is to take an existing recommendation and change it to > something else, just to make bug reports go away for one implementations. > That's like two sleepers fighting over a blanket that's too

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Asmus Freytag (c) via Unicode
lace.net] Sent: Thursday, June 1, 2017 4:05 AM To: Henri Sivonen<hsivo...@hsivonen.fi> <mailto:hsivo...@hsivonen.fi> Cc: unicode Unicode Discussion<unicode@unicode.org> <mailto:unicode@unicode.org>; Shawn Steele<shawn.ste...@microsoft.com> <mailto:shawn.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 1 Jun 2017 12:32:08 +0300 Henri Sivonen via Unicode wrote: > On Wed, May 31, 2017 at 8:11 PM, Richard Wordingham via Unicode > wrote: > > On Wed, 31 May 2017 15:12:12 +0300 > > Henri Sivonen via Unicode wrote: > >> I am

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Shawn Steele via Unicode
@unicode.org Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote: I think that the (or a) key problem is that the current "best practice" is treated as "SHOULD" in RFC parlance. W

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Asmus Freytag via Unicode
stair Houghton [mailto:alast...@alastairs-place.net] Sent: Thursday, June 1, 2017 4:05 AM To: Henri Sivonen <hsivo...@hsivonen.fi> Cc: unicode Unicode Discussion <unicode@unicode.org>; Shawn Steele <shawn.ste...@microsoft.com> Subject: Re: Feedback on the proposal to change U+F

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Shawn Steele via Unicode
"strong". -Shawn -Original Message- From: Alastair Houghton [mailto:alast...@alastairs-place.net] Sent: Thursday, June 1, 2017 4:05 AM To: Henri Sivonen <hsivo...@hsivonen.fi> Cc: unicode Unicode Discussion <unicode@unicode.org>; Shawn Steele <shawn.ste...@micro

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Asmus Freytag via Unicode
On 6/1/2017 2:32 AM, Henri Sivonen via Unicode wrote: O On Wed, May 31, 2017 at 10:38 PM, Doug Ewell via Unicode wrote: Henri Sivonen wrote: If anything, I hope this thread results in the establishment of a requirement for proposals to come with proper research about

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Alastair Houghton via Unicode
On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode wrote: > > On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode > wrote: >> * As far as I can tell, there are two (maybe three) sane approaches to this >> problem: >>* Either a "maximal"

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Henri Sivonen via Unicode
On Wed, May 31, 2017 at 8:11 PM, Richard Wordingham via Unicode wrote: > On Wed, 31 May 2017 15:12:12 +0300 > Henri Sivonen via Unicode wrote: >> I am not claiming it's too difficult to implement. I think it >> inappropriate to ask implementations, even

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Alastair Houghton via Unicode
On 31 May 2017, at 20:42, Shawn Steele via Unicode wrote: > >> And *that* is what the specification says. The whole problem here is that >> someone elevated >> one choice to the status of “best practice”, and it’s a choice that some of >> us don’t think *should* >> be

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-06-01 Thread Alastair Houghton via Unicode
On 31 May 2017, at 20:24, Shawn Steele via Unicode wrote: > > > For implementations that emit FFFD while handling text conversion and > > repair (ie, converting ill-formed > > UTF-8 to well-formed), it is best for interoperability if they get the same > > results, so that

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Richard Wordingham via Unicode
On Wed, 31 May 2017 19:24:04 + Shawn Steele via Unicode wrote: > It seems to me that being able to use a data stream of ambiguous > quality in another application with predictable results, then that > stream should be “repaired” prior to being handed over. Then both >

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> And *that* is what the specification says. The whole problem here is that > someone elevated > one choice to the status of “best practice”, and it’s a choice that some of > us don’t think *should* > be considered best practice. > Perhaps “best practice” should simply be altered to say that

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Doug Ewell via Unicode
Henri Sivonen wrote: > If anything, I hope this thread results in the establishment of a > requirement for proposals to come with proper research about what > multiple prominent implementations to about the subject matter of a > proposal concerning changes to text about implementation behavior.

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> it’s more meaningful for whoever sees the output to see a single U+FFFD > representing > the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid > lead byte and > then another for an “unexpected” trailing byte. I disagree. It may be more meaningful for some applications

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> For implementations that emit FFFD while handling text conversion and repair > (ie, converting ill-formed > UTF-8 to well-formed), it is best for interoperability if they get the same > results, so that indices within the > resulting strings are consistent across implementations for all the

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Mark Davis ☕️ via Unicode
> I do not understand the energy being invested in a case that shouldn't happen, especially in a case that is a subset of all the other bad cases that could happen. I think Richard stated the most compelling reason: … The bug you mentioned arose from two different ways of counting the string

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Alastair Houghton via Unicode
On 31 May 2017, at 18:43, Shawn Steele via Unicode wrote: > > It is unclear to me what the expected behavior would be for this corruption > if, for example, there were merely a half dozen 0x80 in the middle of ASCII > text? Is that garbage a single "character"? Perhaps

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Alastair Houghton via Unicode
> On 30 May 2017, at 18:11, Shawn Steele via Unicode > wrote: > >> Which is to completely reverse the current recommendation in Unicode 9.0. >> While I agree that this might help you fending off a bug report, it would >> create chances for bug reports for Ruby, Python3,

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Shawn Steele via Unicode
> > In either case, the bad characters are garbage, so neither approach is > > "better" - except that one or the other may be more conducive to the > > requirements of the particular API/application. > There's a potential issue with input methods that indirectly edit the backing > store. For

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Richard Wordingham via Unicode
On Wed, 31 May 2017 15:12:12 +0300 Henri Sivonen via Unicode wrote: > The write-up mentions > https://bugs.chromium.org/p/chromium/issues/detail?id=662822#c13 . I'd > like to draw everyone's attention to that bug, which is real-world > evidence of a bug arising from two

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Henri Sivonen via Unicode
I've researched this more. While the old advice dominates the handling of non-shortest forms, there is more variation than I previously thought when it comes to truncated sequences and CESU-8-style surrogates. Still, the ICU behavior is an outlier considering the set of implementations that I

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Richard Wordingham via Unicode
On Fri, 26 May 2017 21:41:49 + Shawn Steele via Unicode wrote: > I totally get the forward/backward scanning in sync without decoding > reasoning for some implementations, however I do not think that the > practices that benefit those should extend to other applications

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Richard Wordingham via Unicode
On Tue, 30 May 2017 16:38:45 -0600 Karl Williamson via Unicode wrote: > Under Best Practices, how many REPLACEMENT CHARACTERs should the > sequence generate? 0, 1, 2, 3, 4 ? > > In practice, how many do parsers generate? See Markus Kuhn's test page

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Richard Wordingham via Unicode
On Fri, 26 May 2017 11:22:37 -0700 Ken Whistler via Unicode wrote: > On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: > > The link provided about the PRI doesn't lead to the comments. > > > > PRI #121 (August, 2008) pre-dated the practice of keeping all the >

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Doug Ewell via Unicode
Original message From: Karl Williamson <pub...@khwilliamson.com> Date: 5/30/17 16:32 (GMT-07:00) To: Doug Ewell <d...@ewellic.org>, Unicode Mailing List <unicode@unicode.org> Subject: Re: Feedback on the proposal to change U+FFFD generation when   decoding ill-f

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode
> Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence > as U+002F. Sort of, maybe. It was not legal for them to generate it though. So you could kind of infer that it was not a legal sequence. -Shawn

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Karl Williamson via Unicode
Under Best Practices, how many REPLACEMENT CHARACTERs should the sequence generate? 0, 1, 2, 3, 4 ? In practice, how many do parsers generate?

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Karl Williamson via Unicode
On 05/30/2017 02:30 PM, Doug Ewell via Unicode wrote: L2/17-168 says: "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, without ever restricting trail bytes to less than 80..BF. For example: is a single maximal subsequence because C0

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Doug Ewell via Unicode
L2/17-168 says: "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, without ever restricting trail bytes to less than 80..BF. For example: is a single maximal subsequence because C0 was originally a lead byte for two-byte sequences." When

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode
> Which is to completely reverse the current recommendation in Unicode 9.0. > While I agree that this might help you fending off a bug report, it would > create chances for bug reports for Ruby, Python3, many if not all Web > browsers,... & Windows & .Net Changing the behavior of the Windows

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode
> I think nobody is debating that this is *one way* to do things, and that some > code does it. Except that they sort of are. The premise is that the "old language was wrong", and the "new language is right." The reason we know the old language was wrong was that there was a bug filed

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Martin J. Dürst via Unicode
Hello Karl, others, On 2017/05/27 06:15, Karl Williamson via Unicode wrote: On 05/26/2017 12:22 PM, Ken Whistler wrote: On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: The link provided about the PRI doesn't lead to the comments. PRI #121 (August, 2008) pre-dated the practice of

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Martin J. Dürst via Unicode
Hello Markus, others, On 2017/05/27 00:41, Markus Scherer wrote: On Fri, May 26, 2017 at 3:28 AM, Martin J. Dürst wrote: But there's plenty in the text that makes it absolutely clear that some things cannot be included. In particular, it says The term “maximal

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Shawn Steele via Unicode
n the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On 05/26/2017 12:22 PM, Ken Whistler wrote: > > On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: >> The link provided about the PRI doesn't lead to the comments. >> > > PRI #121 (A

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Karl Williamson via Unicode
On 05/26/2017 12:22 PM, Ken Whistler wrote: On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: The link provided about the PRI doesn't lead to the comments. PRI #121 (August, 2008) pre-dated the practice of keeping all the feedback comments together with the PRI itself in a numbered

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Ken Whistler via Unicode
On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: The link provided about the PRI doesn't lead to the comments. PRI #121 (August, 2008) pre-dated the practice of keeping all the feedback comments together with the PRI itself in a numbered directory with the name "feedback.html".

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Karl Williamson via Unicode
On 05/26/2017 04:28 AM, Martin J. Dürst wrote: It may be worth to think about whether the Unicode standard should mention implementations like yours. But there should be no doubt about the fact that the PRI and Unicode 5.2 (and the current version of Unicode) are clear about what they

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Markus Scherer via Unicode
On Fri, May 26, 2017 at 3:28 AM, Martin J. Dürst wrote: > But there's plenty in the text that makes it absolutely clear that some > things cannot be included. In particular, it says > > > The term “maximal subpart of an ill-formed subsequence” refers to the code >

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Philippe Verdy via Unicode
> > Citing directly from the PRI: > > > The term "maximal subpart of the ill-formed subsequence" refers to the > longest potentially valid initial subsequence or, if none, then to the next > single code unit. > > The way i understand it is that C0 80 will have TWO maximal subparts,

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Martin J. Dürst via Unicode
On 2017/05/25 09:22, Markus Scherer wrote: On Wed, May 24, 2017 at 3:56 PM, Karl Williamson wrote: On 05/24/2017 12:46 AM, Martin J. Dürst wrote: That's wrong. There was a public review issue with various options and with feedback, and the recommendation has been

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Markus Scherer via Unicode
On Wed, May 24, 2017 at 3:56 PM, Karl Williamson wrote: > On 05/24/2017 12:46 AM, Martin J. Dürst wrote: > >> That's wrong. There was a public review issue with various options and >> with feedback, and the recommendation has been implemented and in use >> widely (among

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Karl Williamson via Unicode
On 05/24/2017 12:46 AM, Martin J. Dürst wrote: On 2017/05/24 05:57, Karl Williamson via Unicode wrote: On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote: Adding a "recommendation" this late in the game is just bad standards policy. Unless I misunderstand, you are missing the

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Martin J. Dürst via Unicode
On 2017/05/24 05:57, Karl Williamson via Unicode wrote: On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote: Adding a "recommendation" this late in the game is just bad standards policy. Unless I misunderstand, you are missing the point. There is already a recommendation listed in

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Karl Williamson via Unicode
On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote: On 5/23/2017 10:45 AM, Markus Scherer wrote: On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode > wrote: So, if the proposal for Unicode really was more of a "feels right"

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Doug Ewell via Unicode
Asmus Freytag \(c\) wrote: > And why add a recommendation that changes that from completely up to > the implementation (or groups of implementations) to something where > one way of doing it now has to justify itself? A recommendation already exists, at the end of Section 3.9. The current

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Shawn Steele via Unicode
> If the thread has made one thing clear is that there's no consensus in the > wider community > that one approach is obviously better. When it comes to ill-formed sequences, > all bets are off. > Simple as that. > Adding a "recommendation" this late in the game is just bad standards policy. I

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Asmus Freytag (c) via Unicode
On 5/23/2017 10:45 AM, Markus Scherer wrote: On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode > wrote: So, if the proposal for Unicode really was more of a "feels right" and not a "deviate at your peril" situation (or necessary

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Alastair Houghton via Unicode
> On 23 May 2017, at 18:45, Markus Scherer via Unicode > wrote: > > On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode > wrote: >> So, if the proposal for Unicode really was more of a "feels right" and not a >> "deviate at your peril"

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Markus Scherer via Unicode
On Tue, May 23, 2017 at 7:05 AM, Asmus Freytag via Unicode < unicode@unicode.org> wrote: > So, if the proposal for Unicode really was more of a "feels right" and not > a "deviate at your peril" situation (or necessary escape hatch), then we > are better off not making a RECOMMEDATION that goes

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Shawn Steele via Unicode
+ the list, which somehow my reply seems to have lost. > I may have missed something, but I think nobody actually proposed to change > the recommendations into requirements No thanks, that would be a breaking change for some implementations (like mine) and force them to become non-complying or

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Asmus Freytag via Unicode
On 5/23/2017 1:24 AM, Martin J. Dürst via Unicode wrote: Hello Mark, On 2017/05/22 01:37, Mark Davis ☕️ via Unicode wrote: I actually didn't see any of this discussion until today. Many thanks for

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Alastair Houghton via Unicode
On 23 May 2017, at 07:10, Jonathan Coxhead via Unicode wrote: > > On 18/05/2017 1:58 am, Alastair Houghton via Unicode wrote: >> On 18 May 2017, at 07:18, Henri Sivonen via Unicode >> wrote: >> >>> the decision complicates U+FFFD generation when

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Martin J. Dürst via Unicode
Hello Mark, On 2017/05/22 01:37, Mark Davis ☕️ via Unicode wrote: I actually didn't see any of this discussion until today. Many thanks for chiming in. ( unicode@unicode.org mail was going into my spam folder...) I started reading the thread, but it looks like a lot of it is OT, As is

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Jonathan Coxhead via Unicode
On 18/05/2017 1:58 am, Alastair Houghton via Unicode wrote: On 18 May 2017, at 07:18, Henri Sivonen via Unicode wrote: the decision complicates U+FFFD generation when validating UTF-8 by state machine. It *really* doesn’t. Even if you’re hell bent on using a pure state

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-21 Thread Mark Davis ☕️ via Unicode
I actually didn't see any of this discussion until today. ( unicode@unicode.org mail was going into my spam folder...) I started reading the thread, but it looks like a lot of it is OT, so just scanned some of them. A few brief points: 1. There is plenty of time for public comment, since it

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Richard Wordingham via Unicode
On Thu, 18 May 2017 09:58:43 +0100 Alastair Houghton via Unicode wrote: > On 18 May 2017, at 07:18, Henri Sivonen via Unicode > wrote: > > > > the decision complicates U+FFFD generation when validating UTF-8 by > > state machine. > > It *really*

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Alastair Houghton via Unicode
On 18 May 2017, at 07:18, Henri Sivonen via Unicode wrote: > > the decision complicates U+FFFD generation when validating UTF-8 by state > machine. It *really* doesn’t. Even if you’re hell bent on using a pure state machine approach, you need to add maybe two additional

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Hans Åberg via Unicode
> On 16 May 2017, at 15:21, Richard Wordingham via Unicode > wrote: > > On Tue, 16 May 2017 14:44:44 +0200 > Hans Åberg via Unicode wrote: > >>> On 15 May 2017, at 12:21, Henri Sivonen via Unicode >>> wrote: >> ... >>> I think

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Alastair Houghton via Unicode
On 18 May 2017, at 06:01, Richard Wordingham via Unicode wrote: > > On Thu, 18 May 2017 02:04:55 +0200 > Philippe Verdy via Unicode wrote: > >> I find intriguating that the update intends to enforce the decoding >> of the **shortest** sequences, but

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Alastair Houghton via Unicode
On 18 May 2017, at 01:04, Philippe Verdy via Unicode wrote: > > I find intriguating that the update intends to enforce the decoding of the > **shortest** sequences, but now wants to treat **maximal sequences** as a > single unit with arbitrary length. UTF-8 was designed

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Henri Sivonen via Unicode
On Thu, May 18, 2017 at 2:41 AM, Asmus Freytag via Unicode wrote: > On 5/17/2017 2:31 PM, Richard Wordingham via Unicode wrote: > > There's some sort of rule that proposals should be made seven days in > advance of the meeting. I can't find it now, so I'm not sure whether >

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Richard Wordingham via Unicode
On Thu, 18 May 2017 02:04:55 +0200 Philippe Verdy via Unicode wrote: > I find intriguating that the update intends to enforce the decoding > of the **shortest** sequences, but now wants to treat **maximal > sequences** as a single unit with arbitrary length. UTF-8 was >

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Doug Ewell via Unicode
Richard Wordingham wrote: I'm afraid I don't get the analogy. You can't build a full Unicode system out of Unicode-compliant parts. Others will have to address Richard's point about canonical-equivalent sequences. However, having dug out Unicode Version 2 Appendix A Section 2 UTF-8 (in

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Philippe Verdy via Unicode
I find intriguating that the update intends to enforce the decoding of the **shortest** sequences, but now wants to treat **maximal sequences** as a single unit with arbitrary length. UTF-8 was designed to work only with some state machines that would NEVER need to parse more than 4 bytes. For

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Asmus Freytag via Unicode
On 5/17/2017 2:31 PM, Richard Wordingham via Unicode wrote: There's some sort of rule that proposals should be made seven days in advance of the meeting. I can't find it now, so I'm not sure whether the actual rule was followed, let alone what authority it has.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Richard Wordingham via Unicode
On Wed, 17 May 2017 15:31:56 -0700 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > > So it was still a legal way for a non-UTF-8-compliant process! > > Anything is possible if you are non-compliant. You can encode U+263A > with 9,786 FF bytes followed by a

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Doug Ewell via Unicode
Richard Wordingham wrote: > So it was still a legal way for a non-UTF-8-compliant process! Anything is possible if you are non-compliant. You can encode U+263A with 9,786 FF bytes followed by a terminating FE byte and call that "UTF-8," if you are willing to be non-compliant enough. > Note for

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Richard Wordingham via Unicode
On Wed, 17 May 2017 13:37:51 -0700 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > >> It is not at all clear what the intent of the encoder was - or even > >> if it's not just a problem with the data stream. E0 80 80 is not > >> permitted, it's garbage. An

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Richard Wordingham via Unicode
On Wed, 17 May 2017 13:41:56 -0700 Doug Ewell via Unicode wrote: > Perhaps surprisingly, it's already too late. UTC approved this change > the day after the proposal was written. > > http://www.unicode.org/L2/L2017/17103.htm#151-C19 Approved for Unicode 11.0. Unicode 10.0

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Hans Åberg via Unicode
> On 17 May 2017, at 23:18, Doug Ewell wrote: > > Hans Åberg wrote: > >>> Far from solving the stated problem, it would introduce a new one: >>> conversion from the "bad data" Unicode code points, currently >>> well-defined, would become ambiguous. >> >> Actually not: just

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Doug Ewell via Unicode
Hans Åberg wrote: >> Far from solving the stated problem, it would introduce a new one: >> conversion from the "bad data" Unicode code points, currently >> well-defined, would become ambiguous. > > Actually not: just translate the invalid UTF-8 sequences into invalid > UTF-32. Far from solving

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Hans Åberg via Unicode
> On 17 May 2017, at 22:36, Doug Ewell via Unicode wrote: > > Hans Åberg wrote: > >> It would be useful, for use with filesystems, to have Unicode >> codepoint markers that indicate how UTF-8, including non-valid >> sequences, is translated into UTF-32 in a way that the

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Doug Ewell via Unicode
Henri Sivonen wrote: > I find it shocking that the Unicode Consortium would change a > widely-implemented part of the standard (regardless of whether Unicode > itself officially designates it as a requirement or suggestion) on > such flimsy grounds. > > I'd like to register my feedback that I

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Doug Ewell via Unicode
Richard Wordingham wrote: >> It is not at all clear what the intent of the encoder was - or even >> if it's not just a problem with the data stream. E0 80 80 is not >> permitted, it's garbage. An encoder can't "intend" it. > > It was once a legal way of encoding NUL, just like C0 E0, which is >

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Doug Ewell via Unicode
Hans Åberg wrote: > It would be useful, for use with filesystems, to have Unicode > codepoint markers that indicate how UTF-8, including non-valid > sequences, is translated into UTF-32 in a way that the original > octet sequence can be restored. I have always argued strongly against this idea,

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Alastair Houghton via Unicode
> On 16 May 2017, at 20:43, Richard Wordingham via Unicode > wrote: > > On Tue, 16 May 2017 11:36:39 -0700 > Markus Scherer via Unicode wrote: > >> Why do we care how we carve up an illegal sequence into subsequences? >> Only for debugging and visual

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 9:36 PM, Markus Scherer wrote: > Let me try to address some of the issues raised here. Thank you. > The proposal changes a recommendation, not a requirement. This is a very bad reason in favor of the change. If anything, this should be a reason why

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
Another alternative for you API is to not return simple integer values, but return (read-only) instances of a Char32 class whose "scalar" property would normally be a valid codepoint with scalar value, or whose "string" property will be the actual character; but with another static property

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
> Faster ok, privided this does not break other uses, notably for random > access within strings… Either way, this is a “recommendation”. I don’t see how that can provide for not-“breaking other uses.” If it’s internal, you can do what you will, so if you need the 1:1 seeming parity, then

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
2017-05-16 20:50 GMT+02:00 Shawn Steele : > But why change a recommendation just because it “feels like”. As you > said, it’s just a recommendation, so if that really annoyed someone, they > could do something else (eg: they could use a single FFFD). > > > > If the

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Richard Wordingham via Unicode
On Tue, 16 May 2017 11:36:39 -0700 Markus Scherer via Unicode wrote: > Why do we care how we carve up an illegal sequence into subsequences? > Only for debugging and visual inspection. Maybe some process is using > illegal, overlong sequences to encode something special (à

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
To: Alastair Houghton <alast...@alastairs-place.net> Cc: Philippe Verdy <verd...@wanadoo.fr>; Henri Sivonen <hsivo...@hsivonen.fi>; unicode Unicode Discussion <unicode@unicode.org>; Hans Åberg <haber...@telia.com> Subject: Re: Feedback on the proposal to change U+FFFD generation

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 16 May 2017, at 19:36, Markus Scherer wrote: > > Let me try to address some of the issues raised here. Thanks for jumping in. The one thing I wanted to ask about was the “without ever restricting trail bytes to less than 80..BF”. I think that could be misinterpreted;

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Markus Scherer via Unicode
Let me try to address some of the issues raised here. The proposal changes a recommendation, not a requirement. Conformance applies to finding and interpreting valid sequences properly. This includes not consuming parts of valid sequences when dealing with illegal ones, as explained in the

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode
> On 16 May 2017, at 20:01, Philippe Verdy wrote: > > On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random > sequences of 16-bit code units are not permitted. There's visibly a > validation step that returns an error if you attempt to create files with

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
2017-05-16 19:30 GMT+02:00 Shawn Steele via Unicode : > C) The data was corrupted by some other means. Perhaps bad > concatenations, lost blocks during read/transmission, etc. If we lost 2 > 512 byte blocks, then maybe we should have a thousand FFFDs (but how would > we

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Asmus Freytag via Unicode
On 5/16/2017 10:30 AM, Shawn Steele via Unicode wrote: Would you advocate replacing e0 80 80 with U+FFFD U+FFFD U+FFFD (1)

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
to:unicode-boun...@unicode.org] On Behalf Of Richard Wordingham via Unicode Sent: Tuesday, May 16, 2017 10:58 AM To: unicode@unicode.org Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On Tue, 16 May 2017 17:30:01 + Shawn Steele via Unicod

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random sequences of 16-bit code units are not permitted. There's visibly a validation step that returns an error if you attempt to create files with invalid sequences (including other restrictions such as forbidding U+ and some

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Richard Wordingham via Unicode
On Tue, 16 May 2017 17:30:01 + Shawn Steele via Unicode wrote: > > Would you advocate replacing > > > e0 80 80 > > > with > > > U+FFFD U+FFFD U+FFFD (1) > > > rather than > > > U+FFFD (2) > > > It’s pretty clear what the

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
> Would you advocate replacing > e0 80 80 > with > U+FFFD U+FFFD U+FFFD (1) > rather than > U+FFFD (2) > It’s pretty clear what the intent of the encoder was there, I’d say, and > while we certainly don’t > want to decode it as a NUL (that was the source of

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode
> On 16 May 2017, at 18:38, Alastair Houghton > wrote: > > On 16 May 2017, at 17:23, Hans Åberg wrote: >> >> HFS implements case insensitivity in a layer above the filesystem raw >> functions. So it is perfectly possible to have files that

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 16 May 2017, at 17:23, Hans Åberg wrote: > > HFS implements case insensitivity in a layer above the filesystem raw > functions. So it is perfectly possible to have files that differ by case only > in the same directory by using low level function calls. The Tenon MachTen

  1   2   >