Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode
> On 16 May 2017, at 15:00, Philippe Verdy wrote: > > 2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode : > > > On 15 May 2017, at 12:21, Henri Sivonen via Unicode > > wrote: > ... > > I think Unicode should not adopt the proposed

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode
> On 16 May 2017, at 17:30, Alastair Houghton via Unicode > wrote: > > On 16 May 2017, at 14:23, Hans Åberg via Unicode wrote: >> >> You don't. You have a filename, which is a octet sequence of unknown >> encoding, and want to deal with it.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
2017-05-16 15:23 GMT+02:00 Hans Åberg : > All current filsystems, as far as experts could recall, use octet > sequences at the lowest level; whatever encoding is used is built in a > layer above > Not NTFS (on Windows) which uses sequences of 16bit units. Same about

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 16 May 2017, at 14:23, Hans Åberg via Unicode wrote: > > You don't. You have a filename, which is a octet sequence of unknown > encoding, and want to deal with it. Therefore, valid Unicode transformations > of the filename may result in that is is not being reachable. >

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 16 May 2017, at 16:44, Hans Åberg wrote: > > On 16 May 2017, at 17:30, Alastair Houghton via Unicode > wrote: >> >> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on >> UCS-2/UTF-16. ... > > The filesystem directory is

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode : > > > On 15 May 2017, at 12:21, Henri Sivonen via Unicode > wrote: > ... > > I think Unicode should not adopt the proposed change. > > It would be useful, for use with filesystems, to have Unicode

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Richard Wordingham via Unicode
On Tue, 16 May 2017 20:08:52 +0900 "Martin J. Dürst via Unicode" wrote: > I agree with others that ICU should not be considered to have a > special status, it should be just one implementation among others. > [The next point is a side issue, please don't spend too much time

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode
> On 16 May 2017, at 17:52, Alastair Houghton > wrote: > > On 16 May 2017, at 16:44, Hans Åberg wrote: >> >> On 16 May 2017, at 17:30, Alastair Houghton via Unicode >> wrote: >>> >>> HFS(+), NTFS and VFAT long

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 16 May 2017, at 17:07, Hans Åberg wrote: > HFS(+), NTFS and VFAT long filenames are all encoded in some variation on UCS-2/UTF-16. ... >>> >>> The filesystem directory is using octet sequences and does not bother >>> passing over an encoding, I am told.

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode
> On 16 May 2017, at 18:13, Alastair Houghton > wrote: > > On 16 May 2017, at 17:07, Hans Åberg wrote: >> > HFS(+), NTFS and VFAT long filenames are all encoded in some variation on > UCS-2/UTF-16. ... The filesystem

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 16 May 2017, at 17:23, Hans Åberg wrote: > > HFS implements case insensitivity in a layer above the filesystem raw > functions. So it is perfectly possible to have files that differ by case only > in the same directory by using low level function calls. The Tenon MachTen

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
> Would you advocate replacing > e0 80 80 > with > U+FFFD U+FFFD U+FFFD (1) > rather than > U+FFFD (2) > It’s pretty clear what the intent of the encoder was there, I’d say, and > while we certainly don’t > want to decode it as a NUL (that was the source of

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
2017-05-16 19:30 GMT+02:00 Shawn Steele via Unicode : > C) The data was corrupted by some other means. Perhaps bad > concatenations, lost blocks during read/transmission, etc. If we lost 2 > 512 byte blocks, then maybe we should have a thousand FFFDs (but how would > we

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Asmus Freytag via Unicode
On 5/16/2017 10:30 AM, Shawn Steele via Unicode wrote: Would you advocate replacing e0 80 80 with U+FFFD U+FFFD U+FFFD (1)

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode
> On 16 May 2017, at 18:38, Alastair Houghton > wrote: > > On 16 May 2017, at 17:23, Hans Åberg wrote: >> >> HFS implements case insensitivity in a layer above the filesystem raw >> functions. So it is perfectly possible to have files that

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Richard Wordingham via Unicode
On Tue, 16 May 2017 17:30:01 + Shawn Steele via Unicode wrote: > > Would you advocate replacing > > > e0 80 80 > > > with > > > U+FFFD U+FFFD U+FFFD (1) > > > rather than > > > U+FFFD (2) > > > It’s pretty clear what the

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
Regardless, it's not legal and hasn't been legal for quite some time. Replacing a hacked embedded "null" with FFFD is going to be pretty breaking to anything depending on that fake-null, so one or three isn't really going to matter. -Original Message- From: Unicode

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random sequences of 16-bit code units are not permitted. There's visibly a validation step that returns an error if you attempt to create files with invalid sequences (including other restrictions such as forbidding U+ and some

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Martin J. Dürst via Unicode
Hello everybody, [using this mail to in effect reply to different mails in the thread] On 2017/05/16 17:31, Henri Sivonen via Unicode wrote: On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag wrote: Under what circumstance would it matter how many U+FFFDs you see?

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
2017-05-16 12:40 GMT+02:00 Henri Sivonen via Unicode : > > One additional note: the standard codifies this behaviour as a > *recommendation*, not a requirement. > > This is an odd argument in favor of changing it. If the argument is > that it's just a recommendation that you

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag wrote: > but I think the way he raises this point is needlessly antagonistic. I apologize. My level of dismay at the proposal's ICU-centricity overcame me. On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
> On 16 May 2017, at 09:18, David Starner wrote: > > On Tue, May 16, 2017 at 12:42 AM Alastair Houghton > wrote: >> If you’re about to mutter something about security, consider this: security >> code *should* refuse to compare strings that

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode wrote: > I’m not sure how the discussion of “which is better” relates to the > discussion of ill-formed UTF-8 at all. Clearly, the "which is better" issue is distracting from the underlying issue. I'll clarify what I

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 6:23 AM, Karl Williamson wrote: > On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote: >> >> In reference to: >> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >> >> I think Unicode should not adopt the proposed change. >> >> The

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 1:09 PM, Alastair Houghton wrote: > On 16 May 2017, at 09:31, Henri Sivonen via Unicode > wrote: >> >> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton >> wrote: >>> That would be true

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
> > The proposal actually does cover things that aren’t structurally valid, > like your e0 e0 e0 example, which it suggests should be a single U+FFFD > because the initial e0 denotes a three byte sequence, and your 80 80 80 > example, which it proposes should constitute three illegal subsequences

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread David Starner via Unicode
On Tue, May 16, 2017 at 1:45 AM Alastair Houghton < alast...@alastairs-place.net> wrote: > That’s true anyway; imagine the database holds raw bytes, that just happen > to decode to U+FFFD. There might seem to be *two* names that both contain > U+FFFD in the same place. How do you distinguish

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
> On 16 May 2017, at 10:29, David Starner wrote: > > On Tue, May 16, 2017 at 1:45 AM Alastair Houghton > wrote: > That’s true anyway; imagine the database holds raw bytes, that just happen to > decode to U+FFFD. There might seem to be

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 16 May 2017, at 09:31, Henri Sivonen via Unicode wrote: > > On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton > wrote: >> That would be true if the in-memory representation had any effect on what >> we’re talking about, but it really

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Henri Sivonen via Unicode
On Tue, May 16, 2017 at 9:50 AM, Henri Sivonen wrote: > Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick > test with three major browsers that use UTF-16 internally and have > independent (of each other) implementations of UTF-8 decoding > (Firefox, Edge

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Asmus Freytag via Unicode
On 5/15/2017 11:50 PM, Henri Sivonen via Unicode wrote: On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode wrote: I’m not sure how the discussion of “which is better” relates to the discussion of ill-formed UTF-8 at all. Clearly, the "which is better" issue is

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 16 May 2017, at 08:22, Asmus Freytag via Unicode wrote: > I therefore think that Henri has a point when he's concerned about tacit > assumptions favoring one memory representation over another, but I think the > way he raises this point is needlessly antagonistic. That

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread J Decker via Unicode
On Mon, May 15, 2017 at 11:50 PM, Henri Sivonen via Unicode < unicode@unicode.org> wrote: > On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode > wrote: > > I’m not sure how the discussion of “which is better” relates to the > > discussion of ill-formed UTF-8 at all. >

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread David Starner via Unicode
On Tue, May 16, 2017 at 12:42 AM Alastair Houghton < alast...@alastairs-place.net> wrote: > If you’re about to mutter something about security, consider this: > security code *should* refuse to compare strings that contain U+FFFD (or at > least should never treat them as equal, even to

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 15 May 2017, at 23:16, Shawn Steele via Unicode wrote: > > I’m not sure how the discussion of “which is better” relates to the > discussion of ill-formed UTF-8 at all. It doesn’t, which is a point I made in my original reply to Henry. The only reason I answered his

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 15 May 2017, at 23:43, Richard Wordingham via Unicode wrote: > > The problem with surrogates is inadequate testing. They're sufficiently > rare for many users that it may be a long time before an error is > discovered. It's not always obvious that code is designed for

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Richard Wordingham via Unicode
On Tue, 16 May 2017 10:01:03 +0300 Henri Sivonen via Unicode wrote: > Even so, I think even changing a recommendation of "best practice" > needs way better rationale than "feels right" or "ICU already does it" > when a) major browsers (which operate in the most prominent >

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Richard Wordingham via Unicode
On Tue, 16 May 2017 14:44:44 +0200 Hans Åberg via Unicode wrote: > > On 15 May 2017, at 12:21, Henri Sivonen via Unicode > > wrote: > ... > > I think Unicode should not adopt the proposed change. > > It would be useful, for use with filesystems, to

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Markus Scherer via Unicode
Let me try to address some of the issues raised here. The proposal changes a recommendation, not a requirement. Conformance applies to finding and interpreting valid sequences properly. This includes not consuming parts of valid sequences when dealing with illegal ones, as explained in the

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
But why change a recommendation just because it “feels like”. As you said, it’s just a recommendation, so if that really annoyed someone, they could do something else (eg: they could use a single FFFD). If the recommendation is truly that meaningless or arbitrary, then we just get into silly

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode
> On 16 May 2017, at 20:01, Philippe Verdy wrote: > > On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random > sequences of 16-bit code units are not permitted. There's visibly a > validation step that returns an error if you attempt to create files with

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Richard Wordingham via Unicode
On Tue, 16 May 2017 11:36:39 -0700 Markus Scherer via Unicode wrote: > Why do we care how we carve up an illegal sequence into subsequences? > Only for debugging and visual inspection. Maybe some process is using > illegal, overlong sequences to encode something special (à

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Alastair Houghton via Unicode
On 16 May 2017, at 19:36, Markus Scherer wrote: > > Let me try to address some of the issues raised here. Thanks for jumping in. The one thing I wanted to ask about was the “without ever restricting trail bytes to less than 80..BF”. I think that could be misinterpreted;

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode
> On 15 May 2017, at 12:21, Henri Sivonen via Unicode > wrote: ... > I think Unicode should not adopt the proposed change. It would be useful, for use with filesystems, to have Unicode codepoint markers that indicate how UTF-8, including non-valid sequences, is translated

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
Another alternative for you API is to not return simple integer values, but return (read-only) instances of a Char32 class whose "scalar" property would normally be a valid codepoint with scalar value, or whose "string" property will be the actual character; but with another static property

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Philippe Verdy via Unicode
2017-05-16 20:50 GMT+02:00 Shawn Steele : > But why change a recommendation just because it “feels like”. As you > said, it’s just a recommendation, so if that really annoyed someone, they > could do something else (eg: they could use a single FFFD). > > > > If the

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Shawn Steele via Unicode
> Faster ok, privided this does not break other uses, notably for random > access within strings… Either way, this is a “recommendation”. I don’t see how that can provide for not-“breaking other uses.” If it’s internal, you can do what you will, so if you need the 1:1 seeming parity, then