Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Another alternative for you API is to not return simple integer values, but return (read-only) instances of a Char32 class whose "scalar" property would normally be a valid codepoint with scalar value, or whose "string" property will be the actual character; but with another static property "isValidScalar" returning "true"; for other ill-formed sequences,"isValidScalar" will be false, the scalar value will be the initial code unit from the input (decoded from the internal representation in tyhe backing store) and the "string" property will be empty. You may also add a special "Char32" static instance representing end-of-file/end-of-string, whose property "isEOF" will be true, and property scalar will be typically -1, "isValid Scalar" will be false, and the "string" property will be the empty string. All this is possible independantly of the internal representation made in the backing store for its own code units (where it may use any extension of standard UTF's or any data compression scheme without exposing it) 2017-05-16 23:08 GMT+02:00 Philippe Verdy: > > > 2017-05-16 20:50 GMT+02:00 Shawn Steele : > >> But why change a recommendation just because it “feels like”. As you >> said, it’s just a recommendation, so if that really annoyed someone, they >> could do something else (eg: they could use a single FFFD). >> >> >> >> If the recommendation is truly that meaningless or arbitrary, then we >> just get into silly discussions of “better” that nobody can really answer. >> >> >> >> Alternatively, how about “one or more FFFDs?” for the recommendation? >> >> >> >> To me it feels very odd to perhaps require writing extra code to detect >> an illegal case. The “best practice” here should maybe be “one or more >> FFFDs, whatever makes your code faster”. >> > > Faster ok, privided this does not break other uses, notably for random > access within strings, where UTF-8 is designed to allow searching backward > on a limited number of bytes (maximum 3) in order to find the leading byte, > and then check its value: > - if it's not found, return back to the initial position and amke the next > access return U+FFFD to signal the error of position: this trailing byte is > part of an ill-formed sequence, and for coherence, any further trailine > bytes fouind after it will **also** return U+FFFD to be coherent (because > these other trailing bytes may also be found bby random access to them. > - it the leading byte is found backward ut does not match the expected > number of trailing bytes after it, return back to the initial random > position where you'll return also U+FFFD. This means that the initial > leading byte (part of the ill-formed sequence) must also return a separate > U+FFFD, given that each following trailing byte will return U+FFFD > isolately when accessing to them. > > If we want coherent decoding with text handling promitives allowing random > access with encoded sequences, there's no other choice than treating EACH > byte part of the ill-formed sequence as individual errors mapped to the > same replacement code point (U+FFFD if that is what is chosen, but these > APIs could as well specify annother replacement character or could > eventually return a non-codepoint if the API return value is not restricted > to only valid codepoints (for example the replacement could be a negative > value whose absolute value matches the invalid code unit, or some other > invalid code unit outside the valid range for code points with scalar > values: isolated surrogates in UTF-16 for example could be returned as is, > or made negative either by returning its opposite or by setting (or'ing) > the most significant bit of the return value). > > The problem will arise when you need to store the replacement values if > the internal backing store is limited to 16-bit code units or 8-bit code > units: this internal backing store may use its own internal extension of > standard UTF's, including the possibility of encoding NULLs as C0,80 (like > what Java does with its "modified UTF-8 internal encoding used in its > compiled binary classes and serializations), or internally using isolated > trailing surrogates to store illformed UTF-8 input by or'ing these bytes > with 0xDC00 that will be returned as an code point with no valid scalar > value. For internally representing illformed UTF-16 sequences, there's no > need to change anything. For internally representing ill-formed UTF-32 > sequences (in fact limited to one 32-bitcode unit), with a 16bit internal > backing store you may need to store 3 16bit values (three isolated trailing > surrogates). For internally representing ill formed UTF-32 in an 8 bit > backing store, you could use 0xC1 followed by 5 five trailing bytes (each > one storing 7 bits of the initial ill-formed code unit from the UTF-32 > input). > > What you'll do in the internal backing store will not be exposed to your > API which will just return either valide
RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> Faster ok, privided this does not break other uses, notably for random > access within strings… Either way, this is a “recommendation”. I don’t see how that can provide for not-“breaking other uses.” If it’s internal, you can do what you will, so if you need the 1:1 seeming parity, then you can do that internally. But if you’re depending on other APIs/libraries/data source/whatever, it would seem like you couldn’t count on that. (And probably shouldn’t even if it was a requirement rather than a recommendation). I’m wary of the idea of attempting random access on a stream that is also manipulating the stream at the same time (decoding apparently). The U+FFFD emitted by this decoding could also require a different # of bytes to reencode. Which might disrupt the presumed parity, depending on how the data access was being handled. -Shawn
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
2017-05-16 20:50 GMT+02:00 Shawn Steele: > But why change a recommendation just because it “feels like”. As you > said, it’s just a recommendation, so if that really annoyed someone, they > could do something else (eg: they could use a single FFFD). > > > > If the recommendation is truly that meaningless or arbitrary, then we just > get into silly discussions of “better” that nobody can really answer. > > > > Alternatively, how about “one or more FFFDs?” for the recommendation? > > > > To me it feels very odd to perhaps require writing extra code to detect an > illegal case. The “best practice” here should maybe be “one or more FFFDs, > whatever makes your code faster”. > Faster ok, privided this does not break other uses, notably for random access within strings, where UTF-8 is designed to allow searching backward on a limited number of bytes (maximum 3) in order to find the leading byte, and then check its value: - if it's not found, return back to the initial position and amke the next access return U+FFFD to signal the error of position: this trailing byte is part of an ill-formed sequence, and for coherence, any further trailine bytes fouind after it will **also** return U+FFFD to be coherent (because these other trailing bytes may also be found bby random access to them. - it the leading byte is found backward ut does not match the expected number of trailing bytes after it, return back to the initial random position where you'll return also U+FFFD. This means that the initial leading byte (part of the ill-formed sequence) must also return a separate U+FFFD, given that each following trailing byte will return U+FFFD isolately when accessing to them. If we want coherent decoding with text handling promitives allowing random access with encoded sequences, there's no other choice than treating EACH byte part of the ill-formed sequence as individual errors mapped to the same replacement code point (U+FFFD if that is what is chosen, but these APIs could as well specify annother replacement character or could eventually return a non-codepoint if the API return value is not restricted to only valid codepoints (for example the replacement could be a negative value whose absolute value matches the invalid code unit, or some other invalid code unit outside the valid range for code points with scalar values: isolated surrogates in UTF-16 for example could be returned as is, or made negative either by returning its opposite or by setting (or'ing) the most significant bit of the return value). The problem will arise when you need to store the replacement values if the internal backing store is limited to 16-bit code units or 8-bit code units: this internal backing store may use its own internal extension of standard UTF's, including the possibility of encoding NULLs as C0,80 (like what Java does with its "modified UTF-8 internal encoding used in its compiled binary classes and serializations), or internally using isolated trailing surrogates to store illformed UTF-8 input by or'ing these bytes with 0xDC00 that will be returned as an code point with no valid scalar value. For internally representing illformed UTF-16 sequences, there's no need to change anything. For internally representing ill-formed UTF-32 sequences (in fact limited to one 32-bitcode unit), with a 16bit internal backing store you may need to store 3 16bit values (three isolated trailing surrogates). For internally representing ill formed UTF-32 in an 8 bit backing store, you could use 0xC1 followed by 5 five trailing bytes (each one storing 7 bits of the initial ill-formed code unit from the UTF-32 input). What you'll do in the internal backing store will not be exposed to your API which will just return either valide codepoints with valid scalar values, or values outside the two valid subranges (so it could possibly negative values, or isolated trailing surrogates). That backing store can also substitute some valid input causing problems (such as NULLs) using 0xC0 plus another byte, that sequence being unexposed by your API which will still be able to return the expected codepoints (but with the minor caveat that the total number of returned codepoints will not match the actual size allocated for the internal backing store (that applications using that API won't even need to know how it is internally represented). In other words: any private extensions are possible internally, but it is possible to isolate it within a blackboxing API which will still be able to chose how to represent the input text (it may as well use a zlib-compressed backing store, or some stateless Huffmann compression based on a static statistic table configured and stored elsewhere, intiialized when you first instantiate your API).
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, 16 May 2017 11:36:39 -0700 Markus Scherer via Unicodewrote: > Why do we care how we carve up an illegal sequence into subsequences? > Only for debugging and visual inspection. Maybe some process is using > illegal, overlong sequences to encode something special (à la Java > string serialization, "modified UTF-8"), and for that it might be > convenient too to treat overlong sequences as single errors. I think that's not quite true. If we are moving back and forth through a buffer containing corrupt text, we need to make sure that moving three characters forward and then three characters back leaves us where we started. That requires internal consistency. One possible issue is with text input methods that access an application's backing store. They can issue updates in the form of 'delete 3 characters and insert ...'. However, if the input method is accessing characters it hasn't written, it's probably misbehaving anyway. Such commands do rather heavily assume that any relevant normalisation by the application will be taken into account by the input method. I once had a go at fixing an application that was misinterpreting 'delete x characters' as 'delete x UTF-16 code units'. It was a horrible mess, as the application's interface layer couldn't peek at the string being edited. Richard.
RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
But why change a recommendation just because it “feels like”. As you said, it’s just a recommendation, so if that really annoyed someone, they could do something else (eg: they could use a single FFFD). If the recommendation is truly that meaningless or arbitrary, then we just get into silly discussions of “better” that nobody can really answer. Alternatively, how about “one or more FFFDs?” for the recommendation? To me it feels very odd to perhaps require writing extra code to detect an illegal case. The “best practice” here should maybe be “one or more FFFDs, whatever makes your code faster”. Best practices may not be requirements, but people will still take time to file bugs that something isn’t following a “best practice”. -Shawn From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Markus Scherer via Unicode Sent: Tuesday, May 16, 2017 11:37 AM To: Alastair HoughtonCc: Philippe Verdy ; Henri Sivonen ; unicode Unicode Discussion ; Hans Åberg Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 Let me try to address some of the issues raised here. The proposal changes a recommendation, not a requirement. Conformance applies to finding and interpreting valid sequences properly. This includes not consuming parts of valid sequences when dealing with illegal ones, as explained in the section "Constraints on Conversion Processes". Otherwise, what you do with illegal sequences is a matter of what you think makes sense -- a matter of opinion and convenience. Nothing more. I wrote my first UTF-8 handling code some 18 years ago, before joining the ICU team. At the time, I believe the ISO UTF-8 definition was not yet limited to U+10, and decoding overlong sequences and those yielding surrogate code points was regarded as a misdemeanor. The spec has been tightened up, but I am pretty sure that most people familiar with how UTF-8 came about would recognize and as single sequences. I believe that the discussion of how to handle illegal sequences came out of security issues a few years ago from some implementations including valid single and lead bytes with preceding illegal sequences. Beyond the "Constraints on Conversion Processes", there was evidently also a desire to recommend how to handle illegal sequences. I think that the current recommendation was an extrapolation of common practice for non-UTF encodings, such as Shift-JIS or GB 18030. It's ok for UTF-8, too, but "it feels like" (yes, that's the level of argument for stuff that doesn't really matter) not treating and as single sequences is "weird". Why do we care how we carve up an illegal sequence into subsequences? Only for debugging and visual inspection. Maybe some process is using illegal, overlong sequences to encode something special (à la Java string serialization, "modified UTF-8"), and for that it might be convenient too to treat overlong sequences as single errors. If you don't like some recommendation, then do something else. It does not matter. If you don't reject the whole input but instead choose to replace illegal sequences with something, then make sure the something is not nothing -- replacing with an empty string can cause security issues. Otherwise, what the something is, or how many of them you put in, is not very relevant. One or more U+FFFDs is customary. When the current recommendation came in, I thought it was reasonable but didn't like the edge cases. At the time, I didn't think it was important to twiddle with the text in the standard, and I didn't care that ICU didn't exactly implement that particular recommendation. I have seen implementations that clobber every byte in an illegal sequence with a space, because it's easier than writing an U+FFFD for each byte or for some subsequences. Fine. Someone might write a single U+FFFD for an arbitrarily long illegal subsequence; that's fine, too. Karl Williamson sent feedback to the UTC, "In short, I believe the best practices are wrong." I think "wrong" is far too strong, but I got an action item to propose a change in the text. I proposed a modified recommendation. Nothing gets elevated to "right" that wasn't, nothing gets demoted to "wrong" that was "right". None of this is motivated by which UTF is used internally. It is true that it takes a tiny bit more thought and work to recognize a wider set of sequences, but a capable implementer will optimize successfully for valid sequences, and maybe even for a subset of those for what might be expected high-frequency code point ranges. Error handling can go into a slow path. In a true state table implementation, it will require more states but should not affect the performance of valid sequences. Many years ago, I decided for ICU to add a small amount of slow-path error-handling code for more
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 16 May 2017, at 19:36, Markus Schererwrote: > > Let me try to address some of the issues raised here. Thanks for jumping in. The one thing I wanted to ask about was the “without ever restricting trail bytes to less than 80..BF”. I think that could be misinterpreted; having thought about it some more, I think you mean “considering any trailing byte in the range 80..BF as valid”. The “less than” threw me the first few times I read it and I started thinking you meant allowing any byte as a trailing byte, which is clearly not right. Otherwise, I’m happy :-) Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Let me try to address some of the issues raised here. The proposal changes a recommendation, not a requirement. Conformance applies to finding and interpreting valid sequences properly. This includes not consuming parts of valid sequences when dealing with illegal ones, as explained in the section "Constraints on Conversion Processes". Otherwise, what you do with illegal sequences is a matter of what you think makes sense -- a matter of opinion and convenience. Nothing more. I wrote my first UTF-8 handling code some 18 years ago, before joining the ICU team. At the time, I believe the ISO UTF-8 definition was not yet limited to U+10, and decoding overlong sequences and those yielding surrogate code points was regarded as a misdemeanor. The spec has been tightened up, but I am pretty sure that most people familiar with how UTF-8 came about would recognize and as single sequences. I believe that the discussion of how to handle illegal sequences came out of security issues a few years ago from some implementations including valid single and lead bytes with preceding illegal sequences. Beyond the "Constraints on Conversion Processes", there was evidently also a desire to recommend how to handle illegal sequences. I think that the current recommendation was an extrapolation of common practice for non-UTF encodings, such as Shift-JIS or GB 18030. It's ok for UTF-8, too, but "it feels like" (yes, that's the level of argument for stuff that doesn't really matter) not treating and as single sequences is "weird". Why do we care how we carve up an illegal sequence into subsequences? Only for debugging and visual inspection. Maybe some process is using illegal, overlong sequences to encode something special (à la Java string serialization, "modified UTF-8"), and for that it might be convenient too to treat overlong sequences as single errors. If you don't like some recommendation, then do something else. It does not matter. If you don't reject the whole input but instead choose to replace illegal sequences with something, then make sure the something is not nothing -- replacing with an empty string can cause security issues. Otherwise, what the something is, or how many of them you put in, is not very relevant. One or more U+FFFDs is customary. When the current recommendation came in, I thought it was reasonable but didn't like the edge cases. At the time, I didn't think it was important to twiddle with the text in the standard, and I didn't care that ICU didn't exactly implement that particular recommendation. I have seen implementations that clobber every byte in an illegal sequence with a space, because it's easier than writing an U+FFFD for each byte or for some subsequences. Fine. Someone might write a single U+FFFD for an arbitrarily long illegal subsequence; that's fine, too. Karl Williamson sent feedback to the UTC, "In short, I believe the best practices are wrong." I think "wrong" is far too strong, but I got an action item to propose a change in the text. I proposed a modified recommendation. Nothing gets elevated to "right" that wasn't, nothing gets demoted to "wrong" that was "right". None of this is motivated by which UTF is used internally. It is true that it takes a tiny bit more thought and work to recognize a wider set of sequences, but a capable implementer will optimize successfully for valid sequences, and maybe even for a subset of those for what might be expected high-frequency code point ranges. Error handling can go into a slow path. In a true state table implementation, it will require more states but should not affect the performance of valid sequences. Many years ago, I decided for ICU to add a small amount of slow-path error-handling code for more human-friendly illegal-sequence reporting. In other words, this was not done out of convenience; it was an inconvenience that seemed justified by nicer error reporting. If you don't like to do so, then don't. Which UTF is better? It depends. They all have advantages and problems. It's all Unicode, so it's all good. ICU largely uses UTF-16 but also UTF-8. It has data structures and code for charset conversion, property lookup, sets of characters (UnicodeSet), and collation that are co-optimized for both UTF-16 and UTF-8. It has a slowly growing set of APIs working directly with UTF-8. So, please take a deep breath. No conformance requirement is being touched, no one is forced to do something they don't like, no special consideration is given for one UTF over another. Best regards, markus
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 16 May 2017, at 20:01, Philippe Verdywrote: > > On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random > sequences of 16-bit code units are not permitted. There's visibly a > validation step that returns an error if you attempt to create files with > invalid sequences (including other restrictions such as forbidding U+ and > some other problematic controls). For it to work the way I suggested, there would be low level routines that handles the names raw, and then on top of that, interface routines doing what you describe. On the Austin Group List, they mentioned a filesystem doing it directly in UTF-16, and it could have been the one you describe.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
2017-05-16 19:30 GMT+02:00 Shawn Steele via Unicode: > C) The data was corrupted by some other means. Perhaps bad > concatenations, lost blocks during read/transmission, etc. If we lost 2 > 512 byte blocks, then maybe we should have a thousand FFFDs (but how would > we known?) > Thousands of U+FFFD's is not a problem (independantly of the internal UTF encoding used): yes the 2512 byte block could then become 3 times larger (if using UTF-8 internal encoding) or 2 times larger (if using UTF-16 internal encoding) but every application should be prepared to support the size expansion with a completely know maximum factor, which could occur as well with any valid CJK-only text. So the size to allocate for the internal sorage is predictable from the size of the input, this is an important feature of all standard UTF's. Being able to handle the worst case of allowed expansion, militates largely for the adoption of UTF-16 as the internal encoding, instead of UTF-8 (where you'll need to allocate more space before decoding the input, if you want to avoid successive memory reallocations, which would impact the performance of your decoder): it's simple to accept input from 512 bytes (or 1KB) buffers, and allocate a 1KB (or 2KB) buffer for storing the intermediate results in the generic decoder, and simpler on the outer level to preallocate buffers with resonable sizes that will be reallocated once if needed to the maximum size, and then reduced to the effective size (if needed) at end of successful decoding (some implementations can use pools of preallocated buffers with small static sizes, allocating new buffers out side the pool only for rare cases where more space will be needed) .
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 5/16/2017 10:30 AM, Shawn Steele via Unicode wrote: Would you advocate replacing e0 80 80 with U+FFFD U+FFFD U+FFFD (1) rather than U+FFFD (2) It’s pretty clear what the intent of the encoder was there, I’d say, and while we certainly don’t want to decode it as a NUL (that was the source of previous security bugs, as I recall), I also don’t see the logic in insisting that it must be decoded to *three* code points when it clearly only represented one in the input. It is not at all clear what the intent of the encoder was - or even if it's not just a problem with the data stream. E0 80 80 is not permitted, it's garbage. An encoder can't "intend" it. Either A) the "encoder" was attempting to be malicious, in which case the whole thing is suspect and garbage, and so the # of FFFD's doesn't matter, or B) the "encoder" is completely broken, in which case all bets are off, again, specifying the # of FFFD's is irrelevant. C) The data was corrupted by some other means. Perhaps bad concatenations, lost blocks during read/transmission, etc. If we lost 2 512 byte blocks, then maybe we should have a thousand FFFDs (but how would we known?) -Shawn Clearly, for the receiver, nothing reliable can be deduced about the raw byte stream once an FFFD has been inserted. For the receiver, there's a fourth case that might have been: D) the raw UTF-8 stream contained a valid U+FFFD
RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Regardless, it's not legal and hasn't been legal for quite some time. Replacing a hacked embedded "null" with FFFD is going to be pretty breaking to anything depending on that fake-null, so one or three isn't really going to matter. -Original Message- From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard Wordingham via Unicode Sent: Tuesday, May 16, 2017 10:58 AM To: unicode@unicode.org Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8 On Tue, 16 May 2017 17:30:01 + Shawn Steele via Unicodewrote: > > Would you advocate replacing > > > e0 80 80 > > > with > > > U+FFFD U+FFFD U+FFFD (1) > > > rather than > > > U+FFFD (2) > > > It’s pretty clear what the intent of the encoder was there, I’d say, > > and while we certainly don’t want to decode it as a NUL (that was > > the source of previous security bugs, as I recall), I also don’t see > > the logic in insisting that it must be decoded to *three* code > > points when it clearly only represented one in the input. > > It is not at all clear what the intent of the encoder was - or even if > it's not just a problem with the data stream. E0 80 80 is not > permitted, it's garbage. An encoder can't "intend" it. It was once a legal way of encoding NUL, just like C0 E0, which is still in use, and seems to be the best way of storing NUL as character content in a *C string*. (Strictly speaking, one can't do it.) It could be lurking in old text or come from an old program that somehow doesn't get used for U+0080 to U+07FF. Converting everything in UCS-2 to 3 bytes was an easily encoded way of converting UTF-16 to UTF-8. Remember the conformance test for the Unicode Collation Algorithm has contained lone surrogates in the past, and the UAX on Unicode Regular Expressions used to require the ability to search for lone surrogates. Richard.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random sequences of 16-bit code units are not permitted. There's visibly a validation step that returns an error if you attempt to create files with invalid sequences (including other restrictions such as forbidding U+ and some other problematic controls). This occurs because the NTFS and FAT driver will also attempt to normalize the string in order to create compatibility 8.3 filenames using the system's native locale (not the current user locale which is used when searching files/enumerating directories or opening files - this could generate errors when the encodings for distinct locales do not match, but should not cause errors when filenames are **first** searched in their UTF-16 encoding specified in applications, but applications that still need to access files using their short name are deprecated). The kind of normalization taken for creating short 8.3 filenames uses OS-specific specific conversion tables built in the filesystem drivers. This generation however has a cost due to the uniqueness constraints (requiring to abbreviate the first part of the 8.3 name to add "~numbered" suffixes before the extension, whose value is unpredicatable if there are other existing "*~1.*" files: it requires the driver to retry with another number, looping if necessary). This also has a (very modest) storage cost but it is less critical than the enumeration step and the fact that these shortened name cannot be predicted by applications. This canonicalization is also required also because the filesystem is case-insensitive (and it's technically not possible to store all the multiple case variants for filenames as assigned aliases/physical links). In classic filesystems for Unix/Linux the only restrictions are on forbidding null bytes, and assigning "/" a role for hierarchic filesystems (unusable anywhere as directory entry name), plus the preservation of "." and ".." entries in directories, meaning that only 8-bit encodings based on 7-bit ASCII are possible, so Linux/Unix are not completely treating thes filenames as pure binary bags of bytes (however if this is not checked and such random names may occur, which will be difficult to handle with classic tools and shells). Some other filesystems for Linux/Unix are still enforcing restrictions (and there exists even versions of them that are supporting case insensitity, in addition to FAT12/FAT16/FAT32/exFAT/NTFS emulated filesystems: this also exists in NFS driver as an option, or in drivers for legacy filesystems initially coming from mainframes, or in filesystem drivers based on FTP, and even in the filesystem driver allowing to mount a Windows registry which is also case-insensitive). Technically in the core kernel of Linux/Unix there's no restriction on the effective encoding (except "/" and null), the actual restrictions are implemented within filesystem drivers, configured only when volumes are mounted: each mounted filesystem can then have its own internal encoding; there will be different behaviors when using a driver for any MacOS filesystem. Linux can perfectly work with NTFS filesystems, except that most of the time, short filenames will be completely ignored and not generated on the fly. This generation of short filenames in a legacy (unspecified) 8-bit codepage is not a requirement of NTFS and it can be disabled also in Windows. But FAT12/FAT16/FAT32 still require these legacy short names to be generated when only the LFN could be used, and the short 8.3 name left completely null in the main directory entry ; but legacy FAT drivers will shoke on these null entries, if they are not tagged by a custom attribute bit as "ignorable but not empty", or if the 8+3 characters do not use specific unique parterns such as "\" followed by 7 pseudo-random characters in the main part, plus 3 other pseudo-random characters in the extension (these 10 characters may use any non null value: they provide nearly 80 bits or more exactly 250^10 identifiers if we exclude the 6 characters "/", "\", ".", ":" NULL and SPACE that are reserved, which could be generated almost predictably simply by hashing the original unabbreviated name with 79 bits from SHA-128, or faster with simple MD5 hahsing, and very rare remaining collisions to handle). Some FAT repait tools will attempt to repair the legacy short filenames that are not unique or cannot be derived from the UTF-16 encoded LFN (this happens when "repairing" a FAT volume initially created on another system that used a different 8-bit OEM codepage, but this "CheckDisk" tools should have an option to not "repair" them, given that modern applications normally do not need these filenames if a LFN is present (even the Windows Explorer will not display these short names because trhey are hidden by default each time there's a LFN which overrides them). We must add however that on FAT filesystems, a LFN will not always be stored if the Unicode name already has
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, 16 May 2017 17:30:01 + Shawn Steele via Unicodewrote: > > Would you advocate replacing > > > e0 80 80 > > > with > > > U+FFFD U+FFFD U+FFFD (1) > > > rather than > > > U+FFFD (2) > > > It’s pretty clear what the intent of the encoder was there, I’d > > say, and while we certainly don’t want to decode it as a NUL (that > > was the source of previous security bugs, as I recall), I also > > don’t see the logic in insisting that it must be decoded to *three* > > code points when it clearly only represented one in the input. > > It is not at all clear what the intent of the encoder was - or even > if it's not just a problem with the data stream. E0 80 80 is not > permitted, it's garbage. An encoder can't "intend" it. It was once a legal way of encoding NUL, just like C0 E0, which is still in use, and seems to be the best way of storing NUL as character content in a *C string*. (Strictly speaking, one can't do it.) It could be lurking in old text or come from an old program that somehow doesn't get used for U+0080 to U+07FF. Converting everything in UCS-2 to 3 bytes was an easily encoded way of converting UTF-16 to UTF-8. Remember the conformance test for the Unicode Collation Algorithm has contained lone surrogates in the past, and the UAX on Unicode Regular Expressions used to require the ability to search for lone surrogates. Richard.
RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> Would you advocate replacing > e0 80 80 > with > U+FFFD U+FFFD U+FFFD (1) > rather than > U+FFFD (2) > It’s pretty clear what the intent of the encoder was there, I’d say, and > while we certainly don’t > want to decode it as a NUL (that was the source of previous security bugs, as > I recall), I also don’t > see the logic in insisting that it must be decoded to *three* code points > when it clearly only > represented one in the input. It is not at all clear what the intent of the encoder was - or even if it's not just a problem with the data stream. E0 80 80 is not permitted, it's garbage. An encoder can't "intend" it. Either A) the "encoder" was attempting to be malicious, in which case the whole thing is suspect and garbage, and so the # of FFFD's doesn't matter, or B) the "encoder" is completely broken, in which case all bets are off, again, specifying the # of FFFD's is irrelevant. C) The data was corrupted by some other means. Perhaps bad concatenations, lost blocks during read/transmission, etc. If we lost 2 512 byte blocks, then maybe we should have a thousand FFFDs (but how would we known?) -Shawn
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 16 May 2017, at 18:38, Alastair Houghton> wrote: > > On 16 May 2017, at 17:23, Hans Åberg wrote: >> >> HFS implements case insensitivity in a layer above the filesystem raw >> functions. So it is perfectly possible to have files that differ by case >> only in the same directory by using low level function calls. The Tenon >> MachTen did that on Mac OS 9 already. > > You keep insisting on this, but it’s not true; I’m a disk utility developer, > and I can tell you for a fact that HFS+ uses a B+-Tree to hold its directory > data (a single one for the entire disk, not one per directory either), and > that that tree is sorted by (CNID, filename) pairs. And since it’s > case-preserving *and* case-insensitive, the comparisons it does to order its > B+-Tree nodes *cannot* be raw. I should know - I’ve actually written the > code for it! > > Even for legacy HFS, which didn’t store UTF-16, but stored a specified Mac > legacy encoding (the encoding used is in the volume header), it’s case > sensitive, so the encoding matters. > > I don’t know what tricks Tenon MachTen pulled on Mac OS 9, but I *do* know > how the filesystem works. One could make files that differed by case in the same directory, and Mac OS 9 did not bother. Legacy HFS tended to slow down with many files in the same directory, so that gave an impression of a tree structure. The BSD filesystem at the time, perhaps the one that Mac OS X once supported, did not store files in a tree, but flat with redundancy. The other info I got on the Austin Group List a decade ago.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 16 May 2017, at 17:23, Hans Åbergwrote: > > HFS implements case insensitivity in a layer above the filesystem raw > functions. So it is perfectly possible to have files that differ by case only > in the same directory by using low level function calls. The Tenon MachTen > did that on Mac OS 9 already. You keep insisting on this, but it’s not true; I’m a disk utility developer, and I can tell you for a fact that HFS+ uses a B+-Tree to hold its directory data (a single one for the entire disk, not one per directory either), and that that tree is sorted by (CNID, filename) pairs. And since it’s case-preserving *and* case-insensitive, the comparisons it does to order its B+-Tree nodes *cannot* be raw. I should know - I’ve actually written the code for it! Even for legacy HFS, which didn’t store UTF-16, but stored a specified Mac legacy encoding (the encoding used is in the volume header), it’s case sensitive, so the encoding matters. I don’t know what tricks Tenon MachTen pulled on Mac OS 9, but I *do* know how the filesystem works. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 16 May 2017, at 18:13, Alastair Houghton> wrote: > > On 16 May 2017, at 17:07, Hans Åberg wrote: >> > HFS(+), NTFS and VFAT long filenames are all encoded in some variation on > UCS-2/UTF-16. ... The filesystem directory is using octet sequences and does not bother passing over an encoding, I am told. Someone could remember one that to used UTF-16 directly, but I think it may not be current. >>> >>> No, that’s not true. All three of those systems store UTF-16 on the disk >>> (give or take). >> >> I am not speaking about what they store, but how the filesystem identifies >> files. > > Well, quite clearly none of those systems treat the UTF-16 strings as binary > either - they’re case insensitive, so how could they? HFS+ even normalises > strings using a variant of a frozen version of the normalisation spec. HFS implements case insensitivity in a layer above the filesystem raw functions. So it is perfectly possible to have files that differ by case only in the same directory by using low level function calls. The Tenon MachTen did that on Mac OS 9 already.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 16 May 2017, at 17:07, Hans Åbergwrote: > HFS(+), NTFS and VFAT long filenames are all encoded in some variation on UCS-2/UTF-16. ... >>> >>> The filesystem directory is using octet sequences and does not bother >>> passing over an encoding, I am told. Someone could remember one that to >>> used UTF-16 directly, but I think it may not be current. >> >> No, that’s not true. All three of those systems store UTF-16 on the disk >> (give or take). > > I am not speaking about what they store, but how the filesystem identifies > files. Well, quite clearly none of those systems treat the UTF-16 strings as binary either - they’re case insensitive, so how could they? HFS+ even normalises strings using a variant of a frozen version of the normalisation spec. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 16 May 2017, at 17:52, Alastair Houghton> wrote: > > On 16 May 2017, at 16:44, Hans Åberg wrote: >> >> On 16 May 2017, at 17:30, Alastair Houghton via Unicode >> wrote: >>> >>> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on >>> UCS-2/UTF-16. ... >> >> The filesystem directory is using octet sequences and does not bother >> passing over an encoding, I am told. Someone could remember one that to used >> UTF-16 directly, but I think it may not be current. > > No, that’s not true. All three of those systems store UTF-16 on the disk > (give or take). I am not speaking about what they store, but how the filesystem identifies files.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 16 May 2017, at 16:44, Hans Åbergwrote: > > On 16 May 2017, at 17:30, Alastair Houghton via Unicode > wrote: >> >> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on >> UCS-2/UTF-16. ... > > The filesystem directory is using octet sequences and does not bother passing > over an encoding, I am told. Someone could remember one that to used UTF-16 > directly, but I think it may not be current. No, that’s not true. All three of those systems store UTF-16 on the disk (give or take). On Windows, the “ANSI” APIs convert the filenames to or from the appropriate Windows code page, while the “Wide” API works in UTF-16, which is the native encoding for VFAT long filenames and NTFS filenames. And, as I said, on Mac OS X and iOS, the kernel expects filenames to be encoded as UTF-8 at the BSD API, regardless of what encoding you might be using in your Terminal (this is different to traditional UNIX behaviour, where how you interpret your filenames is entirely up to you - usually you’d use the same encoding you were using on your tty). Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 16 May 2017, at 17:30, Alastair Houghton via Unicode> wrote: > > On 16 May 2017, at 14:23, Hans Åberg via Unicode wrote: >> >> You don't. You have a filename, which is a octet sequence of unknown >> encoding, and want to deal with it. Therefore, valid Unicode transformations >> of the filename may result in that is is not being reachable. >> >> It only matters that the correct octet sequence is handed back to the >> filesystem. All current filsystems, as far as experts could recall, use >> octet sequences at the lowest level; whatever encoding is used is built in a >> layer above. > > HFS(+), NTFS and VFAT long filenames are all encoded in some variation on > UCS-2/UTF-16. ... The filesystem directory is using octet sequences and does not bother passing over an encoding, I am told. Someone could remember one that to used UTF-16 directly, but I think it may not be current.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 16 May 2017, at 14:23, Hans Åberg via Unicodewrote: > > You don't. You have a filename, which is a octet sequence of unknown > encoding, and want to deal with it. Therefore, valid Unicode transformations > of the filename may result in that is is not being reachable. > > It only matters that the correct octet sequence is handed back to the > filesystem. All current filsystems, as far as experts could recall, use octet > sequences at the lowest level; whatever encoding is used is built in a layer > above. HFS(+), NTFS and VFAT long filenames are all encoded in some variation on UCS-2/UTF-16. FAT 8.3 names are also encoded, but the encoding isn’t specified (more specifically, MS-DOS and Windows assume an encoding based on your locale, which could cause all kinds of fun if you swapped disks with someone from a different country, and IIRC there are some shenanigans for Japan because of the use of 0xe5 as a deleted file marker). There are some less widely used filesystems that require a particular encoding also (BeOS’ BFS used UTF-8, for instance). Also, Mac OS X and iOS use UTF-8 at the BSD layer; if a filesystem is in use whose names can’t be converted to UTF-8, the Darwin kernel uses a percent encoding scheme(!) It looks like Apple has changed its mind for APFS and is going with the “bag of bytes” approach that’s typical of other systems; at least, that’s what it appears to have done on iOS. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
2017-05-16 15:23 GMT+02:00 Hans Åberg: > All current filsystems, as far as experts could recall, use octet > sequences at the lowest level; whatever encoding is used is built in a > layer above > Not NTFS (on Windows) which uses sequences of 16bit units. Same about FAT32/exFAT within "Long File Names" (the legacy 8.3 short filenames are using legacy 8-bit codepages, but these are alternate filenames used when long filenames are not found, and working mostly like aliasing physical links on Unix filesystems, as if they were separate directory entries, except that they are hidden by default when their matching LFN are already shown)
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 16 May 2017, at 15:00, Philippe Verdywrote: > > 2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode : > > > On 15 May 2017, at 12:21, Henri Sivonen via Unicode > > wrote: > ... > > I think Unicode should not adopt the proposed change. > > It would be useful, for use with filesystems, to have Unicode codepoint > markers that indicate how UTF-8, including non-valid sequences, is translated > into UTF-32 in a way that the original octet sequence can be restored. > > Why just UTF-32 ? Synonym for codepoint numbers. It would suffice to add markers how it is translated. For example, codepoints meaning "overlong long length ", "byte", or whatever is useful. > How would you convert ill-formed UTF-8/UTF-16/UTF-32 to valid > UTF-8/UTF-16/UTF-32 ? You don't. You have a filename, which is a octet sequence of unknown encoding, and want to deal with it. Therefore, valid Unicode transformations of the filename may result in that is is not being reachable. It only matters that the correct octet sequence is handed back to the filesystem. All current filsystems, as far as experts could recall, use octet sequences at the lowest level; whatever encoding is used is built in a layer above.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, 16 May 2017 14:44:44 +0200 Hans Åberg via Unicodewrote: > > On 15 May 2017, at 12:21, Henri Sivonen via Unicode > > wrote: > ... > > I think Unicode should not adopt the proposed change. > > It would be useful, for use with filesystems, to have Unicode > codepoint markers that indicate how UTF-8, including non-valid > sequences, is translated into UTF-32 in a way that the original octet > sequence can be restored. Escape sequences for the inappropriate bytes is the natural technique. Your problem is smoothly transitioning so that the escape character is always escaped when it means itself. Strictly, it can't be done. Of course, some sequences of escaped characters should be prohibited. Checking could be fiddly. Richard.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, 16 May 2017 20:08:52 +0900 "Martin J. Dürst via Unicode"wrote: > I agree with others that ICU should not be considered to have a > special status, it should be just one implementation among others. > [The next point is a side issue, please don't spend too much time on > it.] I find it particularly strange that at a time when UTF-8 is > firmly defined as up to 4 bytes, never including any bytes above > 0xF4, the Unicode consortium would want to consider recommending that > be converted to a single U+FFFD. I note with > agreement that Markus seems to have thoughts in the same direction, > because the proposal (17168-utf-8-recommend.pdf) says "(I suppose > that lead bytes above F4 could be somewhat debatable.)". The undesirable sidetrack, I suppose, is worrying about how many planes will be required for emoji. However, it does make for the point that, while some practices may be better than other, there isn't necessarily a best practice. The English of the proposal is unclear - the text would benefit from showing some maximal subsequences (poor terminology - some of us are used to non-contiguous subsequences). When he writes, "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, without ever restricting trail bytes to less than 80..BF", I am pretty sure he means "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, with the only restriction on trailing bytes beyond the number of them being that they must be in the range 80..BF". Thus Philippe's example of "E0 E0 C3 89" would be converted with an error flagged to a sequence of scalar values FFFD FFFD C9. This may make a UTF-8 system usable if it tries to use something like non-characters as understood before CLDR was caught publishing them as an essential part of text files. Richard.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode: > > > On 15 May 2017, at 12:21, Henri Sivonen via Unicode > wrote: > ... > > I think Unicode should not adopt the proposed change. > > It would be useful, for use with filesystems, to have Unicode codepoint > markers that indicate how UTF-8, including non-valid sequences, is > translated into UTF-32 in a way that the original octet sequence can be > restored. Why just UTF-32 ? How would you convert ill-formed UTF-8/UTF-16/UTF-32 to valid UTF-8/UTF-16/UTF-32 ? In all cases this would require extensions on the 3 standards (which MUST be interoperable), then you'll shoke on new validation rules for these 3 standards for these extensions, and new ill-formed sequences that you won't be able to convert interoperably. Given the most restrictive condition in UTF-16 (which is still the most widely used internal representation), such extensions would be very complex too manage. There's no solution, such extensions in any one of them are then undesirable and can only be used privately (but without interoperating with the other 2 representations), so it's impossible to make sure the original octet sequences can be restored. Any deviation of the UTF-8/16/32 will be bounded in the same UTF. It cannot be part of the 3 standard UTF, but may be part of a distinct encoding, not fully compatible with the 3 standards.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 15 May 2017, at 12:21, Henri Sivonen via Unicode> wrote: ... > I think Unicode should not adopt the proposed change. It would be useful, for use with filesystems, to have Unicode codepoint markers that indicate how UTF-8, including non-valid sequences, is translated into UTF-32 in a way that the original octet sequence can be restored.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
2017-05-16 12:40 GMT+02:00 Henri Sivonen via Unicode: > > One additional note: the standard codifies this behaviour as a > *recommendation*, not a requirement. > > This is an odd argument in favor of changing it. If the argument is > that it's just a recommendation that you don't need to adhere to, > surely then the people who don't like the current recommendation > should choose not to adhere to it instead of advocating changing it. I also agree. The internet is full of RFC specifications that are also "best practices" and even in this case, changing them must be extensively documented, including discussing new compatibility/interoperability problems and new security risks. The case of random access in substrings is significant because what was once valid UTF-8 could become invalid if the best recommandation is not followed, and then could cause unexpected failures, uncaught exceptions causing software to suddenly fail and become subject to possible attacks due to this new failure (this is mostly a problem for implementations that do not use "safe" U+FFFD replacements but throw exceptions on ill-formed input: we should not change the cases where these exceptions may occur by adding new cases caused by a change of implementation based on a change of best practice). The considerations about trying to reduce the nnumber of U+FFFD is not relevant, purely esthetic because some people would like to compact the decoded result in memory. What is really import is to not ignore silently these ill-formed sequences, and properly track that there was some data loss. The number of U+FFFD (only one or as many as there are invalid code units in the input before the first resynchronization point) inserted is not so important. As well, wether implementations will use an accumulator or just a single state (where each state knows how many code units have been parsed without emitting an output code point, so that these code points can be decoded by relative indexed accesses) is not relevant, it is just a very minor optimization case (in my opinion, using an accumulator that can live in a CPU register is faster than using relative indexed accesses All modern CPUs have enough registers to store that accumulator, and the input and output pointers, and a finite state number is not needed when the state can be tracked by the executable instruction position where you don't necessarily need to loop for each code unit but can easily write your decoder so that each loop will process a full codepoint or will emit a single U+FFFD before adjusting the input pointer : UTF-8 and UTF-16 complexity is small enough that unwinding such loops will be easy to implement for processing full code points instead of single code units: That code will still remain very small (fitting fully in instruction cache), and it will be faster because it will avoid several conditional branches and because it will save one register (for the finite state number) that will not ned to be slowly saved on a stack: 2 pointer registers (or 2 access function/method addresses) + 2 data registers + the PC instruction counter is enough.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Hello everybody, [using this mail to in effect reply to different mails in the thread] On 2017/05/16 17:31, Henri Sivonen via Unicode wrote: On Tue, May 16, 2017 at 10:22 AM, Asmus Freytagwrote: Under what circumstance would it matter how many U+FFFDs you see? Maybe it doesn't, but I don't think the burden of proof should be on the person advocating keeping the spec and major implementations as they are. If anything, I think those arguing for a change of the spec in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing with the current spec should show why it's important to have a different number of U+FFFDs than the spec's "best practice" calls for now. I have just checked (the programming language) Ruby. Some background: As you might know, Ruby is (at least in theory) pretty encoding-independent, meaning you can run scripts in iso-8859-1, in Shift_JIS, in UTF-8, or in any of quite a few other encodings directly, without any conversion. However, in practice, incl. Ruby on Rails, Ruby is very much using UTF-8 internally, and is optimized to work well that way. Character encoding conversion also works with UTF-8 as the pivot encoding. As far as I understand, Ruby does the same as all of the above software, based (among else) on the fact that we followed the recommendation in the standard. Here are a few examples (sorry for the linebreaks introduced by mail software): $ ruby -e 'puts "\xF0\xaf".encode("UTF-16BE", invalid: :replace).inspect' #=>"\uFFFD" $ ruby -e 'puts "\xe0\x80\x80".encode("UTF-16BE", invalid: :replace).inspect' #=>"\uFFFD\uFFFD\uFFFD" $ ruby -e 'puts "\xF4\x90\x80\x80".encode("UTF-16BE", invalid: :replace).inspect' #=>"\uFFFD\uFFFD\uFFFD\uFFFD" $ ruby -e 'puts "\xfd\x81\x82\x83\x84\x85".encode("UTF-16BE", invalid: :replace).inspect #=>"\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD" $ ruby -e 'puts "\x41\xc0\xaf\x41\xf4\x80\x80\x41".encode("UTF-16BE", invalid: :replace).inspect' #=>"A\uFFFD\uFFFDA\uFFFDA" This is based on http://www.unicode.org/review/pr-121.html as noted at https://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/test/ruby/test_transcode.rb?revision=56516=markup#l1507 (for those having a look at these tests, in Ruby's version of assert_equal, the expected value comes first (not sure whether this is called little-endian or big-endian :-), but this is a decision where the various test frameworks are virtually split 50/50 :-(. )) Even if the above examples and the tests use conversion to UTF-16 (in particular the BE variant for better readability), what happens internally is that the input is analyzed byte-by-byte. In this case, it is easiest to just stop as soon as something is found that is clearly invalid (be this a single byte or something longer). This makes a data-driven implementation (such as the Ruby transcoder) or one based on a state machine (such as http://bjoern.hoehrmann.de/utf-8/decoder/dfa/) more compact. In other words, because we never know whether the next byte is a valid one such as 0x41, it's easier to just handle one byte at a time if this way we can avoid lookahead (which is always a good idea when parsing). I agree with Henri and others that there is no need at all to change the recommendation in the standard that has been stable for so long (close to 9 years). Because the original was done on a PR (http://www.unicode.org/review/pr-121.html), I think this should at least also be handled as PR (if it's not dropped based on the discussion here). I think changing the current definition of "maximal subsequence" is a bad idea, because it would mean that one wouldn't know what one was speaking about over the years. If necessary, new definitions should be introduced for other variants. I agree with others that ICU should not be considered to have a special status, it should be just one implementation among others. [The next point is a side issue, please don't spend too much time on it.] I find it particularly strange that at a time when UTF-8 is firmly defined as up to 4 bytes, never including any bytes above 0xF4, the Unicode consortium would want to consider recommending that 84 85> be converted to a single U+FFFD. I note with agreement that Markus seems to have thoughts in the same direction, because the proposal (17168-utf-8-recommend.pdf) says "(I suppose that lead bytes above F4 could be somewhat debatable.)". Regards,Martin.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> > The proposal actually does cover things that aren’t structurally valid, > like your e0 e0 e0 example, which it suggests should be a single U+FFFD > because the initial e0 denotes a three byte sequence, and your 80 80 80 > example, which it proposes should constitute three illegal subsequences > (again, both reasonable). However, I’m not entirely certain about things > like > > e0 e0 c3 89 > > which the proposal would appear to decode as > > U+FFFD U+FFFD U+FFFD U+FFFD (3) > > instead of a perhaps more reasonable > > U+FFFD U+FFFD U+00C9 (4) > > (the key part is the “without ever restricting trail bytes to less than > 80..BF”) > I also agree with that, due to access in strings from random position: if you access it from byte 0x89, you can assume it's a trialing byte and you'll want to look backward, and will see 0xc3,0x89 which will decode correctly as U+00C9 without any error detected. So the wrong bytes are only the initial two occurences of 0x80 which are individually converted to U+FFFD. In summary: when you detect any ill-formed sequence, only replace the first code unit by U+FFFD and restart scanning from the next code unit, without skeeping over multiple bytes. This means that multiple occurences of U+FFFD is not only the best practice, it also matches the intended design of UTF-8 to allow access from random positions.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, May 16, 2017 at 1:09 PM, Alastair Houghtonwrote: > On 16 May 2017, at 09:31, Henri Sivonen via Unicode > wrote: >> >> On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton >> wrote: >>> That would be true if the in-memory representation had any effect on what >>> we’re talking about, but it really doesn’t. >> >> If the internal representation is UTF-16 (or UTF-32), it is a likely >> design that there is a variable into which the scalar value of the >> current code point is accumulated during UTF-8 decoding. > > That’s quite a likely design with a UTF-8 internal representation too; it’s > just that you’d only decode during processing, as opposed to immediately at > input. The time to generate the U+FFFDs is at the input time which is what's at issue here. The later processing, which may then involve iterating by code point and involving computing the scalar values is a different step that should be able to assume valid UTF-8 and not be concerned with invalid UTF-8. (To what extent different programming languages and frameworks allow confident maintenance of the invariant that after input all in-RAM UTF-8 can be treated as valid varies.) >> When the internal representation is UTF-8, only UTF-8 validation is >> needed, and it's natural to have a fail-fast validator, which *doesn't >> necessarily need such a scalar value accumulator at all*. > > Sure. But a state machine can still contain appropriate error states without > needing an accumulator. As I said upthread, it could, but it seems inappropriate to ask implementations to take on that extra complexity on as weak grounds as "ICU does it" or "feels right" when the current recommendation doesn't call for those extra states and the current spec is consistent with a number of prominent non-ICU implementations, including Web browsers. >>> In what sense is this “interop”? >> >> In the sense that prominent independent implementations do the same >> externally observable thing. > > The argument is, I think, that in this case the thing they are doing is the > *wrong* thing. It's seems weird to characterize following the currently-specced "best practice" as "wrong" without showing a compelling fundamental flaw (such as a genuine security problem) in the currently-specced "best practice". With implementations of the currently-specced "best practice" already shipped, I don't think aesthetic preferences should be considered enough of a reason to proclaim behavior adhering to the currently-specced "best practice" as "wrong". > That many of them do it would only be an argument if there was some reason > that it was desirable that they did it. There doesn’t appear to be such a > reason, unless you can think of something that hasn’t been mentioned thus far? I've already given a reason: UTF-8 validation code not needing to have extra states catering to aesthetic considerations of U+FFFD consolidation. > The only reason you’ve given, to date, is that they currently do that, so > that should be the recommended behaviour (which is little different from the > argument - which nobody deployed - that ICU currently does the other thing, > so *that* should be the recommended behaviour; the only difference is that > *you* care about browsers and don’t care about ICU, whereas you yourself > suggested that some of us might be advocating this decision because we care > about ICU and not about e.g. browsers). Not just browsers. Also OpenJDK and Python 3. Do I really need to test the standard libraries of more languages/systems to more strongly make the case that the ICU behavior (according to the proposal PDF) is not the norm and what the spec currently says is? > I’ll add also that even among the implementations you cite, some of them > permit surrogates in their UTF-8 input (i.e. they’re actually processing > CESU-8, not UTF-8 anyway). Python, for example, certainly accepts the > sequence [ed a0 bd ed b8 80] and decodes it as U+1F600; a true “fast fail” > implementation that conformed literally to the recommendation, as you seem to > want, should instead replace it with *four* U+FFFDs (I think), no? I see that behavior in Python 2. Earlier, I said that Python 3 agrees with the current spec for my test case. The Python 2 behavior I see is not just against "best practice" but obviously incompliant. (For details: I tested Python 2.7.12 and 3.5.2 as shipped on Ubuntu 16.04.) > One additional note: the standard codifies this behaviour as a > *recommendation*, not a requirement. This is an odd argument in favor of changing it. If the argument is that it's just a recommendation that you don't need to adhere to, surely then the people who don't like the current recommendation should choose not to adhere to it instead of advocating changing it. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 16 May 2017, at 09:31, Henri Sivonen via Unicodewrote: > > On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton > wrote: >> That would be true if the in-memory representation had any effect on what >> we’re talking about, but it really doesn’t. > > If the internal representation is UTF-16 (or UTF-32), it is a likely > design that there is a variable into which the scalar value of the > current code point is accumulated during UTF-8 decoding. That’s quite a likely design with a UTF-8 internal representation too; it’s just that you’d only decode during processing, as opposed to immediately at input. > When the internal representation is UTF-8, only UTF-8 validation is > needed, and it's natural to have a fail-fast validator, which *doesn't > necessarily need such a scalar value accumulator at all*. Sure. But a state machine can still contain appropriate error states without needing an accumulator. That the ones you care about currently don’t is readily apparent, but there’s nothing stopping them from doing so. I don’t see this as an argument about implementations, since it really makes very little difference to the implementation which approach is taken; in both internal representations, the question is whether you generate U+FFFD immediately on detection of the first incorrect *byte*, or whether you do so after reading a complete sequence. UTF-8 sequences are bounded anyway, so it isn’t as if failing early gives you any significant performance benefit. >> In what sense is this “interop”? > > In the sense that prominent independent implementations do the same > externally observable thing. The argument is, I think, that in this case the thing they are doing is the *wrong* thing. That many of them do it would only be an argument if there was some reason that it was desirable that they did it. There doesn’t appear to be such a reason, unless you can think of something that hasn’t been mentioned thus far? The only reason you’ve given, to date, is that they currently do that, so that should be the recommended behaviour (which is little different from the argument - which nobody deployed - that ICU currently does the other thing, so *that* should be the recommended behaviour; the only difference is that *you* care about browsers and don’t care about ICU, whereas you yourself suggested that some of us might be advocating this decision because we care about ICU and not about e.g. browsers). I’ll add also that even among the implementations you cite, some of them permit surrogates in their UTF-8 input (i.e. they’re actually processing CESU-8, not UTF-8 anyway). Python, for example, certainly accepts the sequence [ed a0 bd ed b8 80] and decodes it as U+1F600; a true “fast fail” implementation that conformed literally to the recommendation, as you seem to want, should instead replace it with *four* U+FFFDs (I think), no? One additional note: the standard codifies this behaviour as a *recommendation*, not a requirement. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 16 May 2017, at 10:29, David Starnerwrote: > > On Tue, May 16, 2017 at 1:45 AM Alastair Houghton > wrote: > That’s true anyway; imagine the database holds raw bytes, that just happen to > decode to U+FFFD. There might seem to be *two* names that both contain > U+FFFD in the same place. How do you distinguish between them? > >> If the database holds raw bytes, then the name is a byte string, not a >> Unicode string, and can't contain U+FFFD at all. It's a relatively easy rule >> to make and enforce that a string in a database is a validly formatted >> string; I would hope that most SQL servers do in fact reject malformed UTF-8 >> strings. On the other hand, I'd expect that an SQL server would accept >> U+FFFD in a Unicode string. Databases typically separate the encoding in which strings are stored from the encoding in which an application connected to the database is operating. A database might well hold data in (say) ISO Latin 1, EUC-JP, or indeed any other character set, while presenting it to a client application as UTF-8 or UTF-16. Hence my comment - application software could very well see two names that are apparently identical and that include U+FFFDs in the same places, even though the database back-end actually has different strings. As I said, this is a problem we already have. > I don’t see a problem; the point is that where a structurally valid UTF-8 > encoding has been used, albeit in an invalid manner (e.g. encoding a number > that is not a valid code point, or encoding a valid code point as an > over-long sequence), a single U+FFFD is appropriate. That seems a perfectly > sensible rule to adopt. > >> It seems like a perfectly arbitrary rule to adopt; I'd like to assume that >> the only source of such UTF-8 data is willful attempts to break security, >> and in that case, how is this a win? Nonattack sources of broken data are >> much more likely to be the result of mixing UTF-8 with other character >> encodings or raw binary data. I’d say there are three sources of UTF-8 data of that ilk: (a) bugs, (b) “Modified UTF-8” and “CESU-8” implementations, (c) wilful attacks (b) in particular is quite common, and the result of the presently recommended approach doesn’t make much sense there ([c0 80] will get replaced with *two* U+FFFDs, while [ed a0 bd ed b8 80] will be replaced by *four* U+FFFDs - surrogates aren’t supposed to be valid in UTF-8, right?) Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, May 16, 2017 at 1:45 AM Alastair Houghton < alast...@alastairs-place.net> wrote: > That’s true anyway; imagine the database holds raw bytes, that just happen > to decode to U+FFFD. There might seem to be *two* names that both contain > U+FFFD in the same place. How do you distinguish between them? > If the database holds raw bytes, then the name is a byte string, not a Unicode string, and can't contain U+FFFD at all. It's a relatively easy rule to make and enforce that a string in a database is a validly formatted string; I would hope that most SQL servers do in fact reject malformed UTF-8 strings. On the other hand, I'd expect that an SQL server would accept U+FFFD in a Unicode string. > I don’t see a problem; the point is that where a structurally valid UTF-8 > encoding has been used, albeit in an invalid manner (e.g. encoding a number > that is not a valid code point, or encoding a valid code point as an > over-long sequence), a single U+FFFD is appropriate. That seems a > perfectly sensible rule to adopt. > It seems like a perfectly arbitrary rule to adopt; I'd like to assume that the only source of such UTF-8 data is willful attempts to break security, and in that case, how is this a win? Nonattack sources of broken data are much more likely to be the result of mixing UTF-8 with other character encodings or raw binary data. >
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
> On 16 May 2017, at 09:18, David Starnerwrote: > > On Tue, May 16, 2017 at 12:42 AM Alastair Houghton > wrote: >> If you’re about to mutter something about security, consider this: security >> code *should* refuse to compare strings that contain U+FFFD (or at least >> should never treat them as equal, even to themselves), because it has no way >> to know what that code point represents. >> > Which causes various other security problems; if an object (file, database > element, etc.) gets a name with a FFFD in it, it becomes impossible to > reference. That an IEEE 754 float may not equal itself is a perpetual source > of confusion for programmers. That’s true anyway; imagine the database holds raw bytes, that just happen to decode to U+FFFD. There might seem to be *two* names that both contain U+FFFD in the same place. How do you distinguish between them? Clearly if you are holding Unicode code points that you know are validly encoded somehow, you may want to be able to match U+FFFDs, but that’s a special case where you have extra knowledge. > In this case, It's pretty clear, but I don't see it as a general rule. Any > rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or mojibake > or random binary data. I don’t see a problem; the point is that where a structurally valid UTF-8 encoding has been used, albeit in an invalid manner (e.g. encoding a number that is not a valid code point, or encoding a valid code point as an over-long sequence), a single U+FFFD is appropriate. That seems a perfectly sensible rule to adopt. The proposal actually does cover things that aren’t structurally valid, like your e0 e0 e0 example, which it suggests should be a single U+FFFD because the initial e0 denotes a three byte sequence, and your 80 80 80 example, which it proposes should constitute three illegal subsequences (again, both reasonable). However, I’m not entirely certain about things like e0 e0 c3 89 which the proposal would appear to decode as U+FFFD U+FFFD U+FFFD U+FFFD (3) instead of a perhaps more reasonable U+FFFD U+FFFD U+00C9 (4) (the key part is the “without ever restricting trail bytes to less than 80..BF”) and if Markus or others could explain why they chose (3) over (4) I’d be quite interested to hear the explanation. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, May 16, 2017 at 10:22 AM, Asmus Freytagwrote: > but I think the way he raises this point is needlessly antagonistic. I apologize. My level of dismay at the proposal's ICU-centricity overcame me. On Tue, May 16, 2017 at 10:42 AM, Alastair Houghton wrote: > That would be true if the in-memory representation had any effect on what > we’re talking about, but it really doesn’t. If the internal representation is UTF-16 (or UTF-32), it is a likely design that there is a variable into which the scalar value of the current code point is accumulated during UTF-8 decoding. In such a scenario, it can be argued as "natural" to first operate according to the general structure of UTF-8 and then inspect what you got in the accumulation variable (ruling out non-shortest forms, values above the Unicode range and surrogate values after the fact). When the internal representation is UTF-8, only UTF-8 validation is needed, and it's natural to have a fail-fast validator, which *doesn't necessarily need such a scalar value accumulator at all*. The construction at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ when used as a UTF-8 validator is the best illustration of a UTF-8 validator not necessarily looking like a "natural" UTF-8 to UTF-16 converter at all. >>> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick >>> test with three major browsers that use UTF-16 internally and have >>> independent (of each other) implementations of UTF-8 decoding >>> (Firefox, Edge and Chrome) shows agreement on the current spec: there >>> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, >>> 6 on the second, 4 on the third and 6 on the last line). Changing the >>> Unicode standard away from that kind of interop needs *way* better >>> rationale than "feels right”. > > In what sense is this “interop”? In the sense that prominent independent implementations do the same externally observable thing. > Under what circumstance would it matter how many U+FFFDs you see? Maybe it doesn't, but I don't think the burden of proof should be on the person advocating keeping the spec and major implementations as they are. If anything, I think those arguing for a change of the spec in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing with the current spec should show why it's important to have a different number of U+FFFDs than the spec's "best practice" calls for now. > If you’re about to mutter something about security, consider this: security > code *should* refuse to compare strings that contain U+FFFD (or at least > should never treat them as equal, even to themselves), because it has no way > to know what that code point represents. In practice, e.g. the Web Platform doesn't allow for stopping operating on input that contains an U+FFFD, so the focus is mainly on making sure that U+FFFDs are placed well enough to prevent bad stuff under normal operations. At least typically, the number of U+FFFDs doesn't matter for that purpose, but when browsers agree on the number of U+FFFDs, changing that number should have an overwhelmingly strong rationale. A security reason could be a strong reason, but such a security motivation for fewer U+FFFDs has not been shown, to my knowledge. > Would you advocate replacing > > e0 80 80 > > with > > U+FFFD U+FFFD U+FFFD (1) > > rather than > > U+FFFD (2) I advocate (1), most simply because that's what Firefox, Edge and Chrome do *in accordance with the currently-recommended best practice* and, less simply, because it makes sense in the presence of a fail-fast UTF-8 validator. I think the burden of proof to show an overwhelmingly good reason to change should, at this point, be on whoever proposes doing it differently than what the current widely-implemented spec says. > It’s pretty clear what the intent of the encoder was there, I’d say, and > while we certainly don’t want to decode it as a NUL (that was the source of > previous security bugs, as I recall), I also don’t see the logic in insisting > that it must be decoded to *three* code points when it clearly only > represented one in the input. As noted previously, the logic is that you generate a U+FFFD whenever a fail-fast validator fails. > This isn’t just a matter of “feels nicer”. (1) is simply illogical > behaviour, and since behaviours (1) and (2) are both clearly out there today, > it makes sense to pick the more logical alternative as the official > recommendation. Again, the current best practice makes perfect logical sense in the context of a fail-fast UTF-8 validator. Moreover, it doesn't look like both are "out there" equally when major browsers, OpenJDK and Python 3 agree. (I expect I could find more prominent implementations that implement the currently-stated best practice, but I feel I shouldn't have to.) From my experience from working on Web standards and implementing them, I think it's
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, May 16, 2017 at 12:42 AM Alastair Houghton < alast...@alastairs-place.net> wrote: > If you’re about to mutter something about security, consider this: > security code *should* refuse to compare strings that contain U+FFFD (or at > least should never treat them as equal, even to themselves), because it has > no way to know what that code point represents. > Which causes various other security problems; if an object (file, database element, etc.) gets a name with a FFFD in it, it becomes impossible to reference. That an IEEE 754 float may not equal itself is a perpetual source of confusion for programmers. > Would you advocate replacing > > e0 80 80 > > with > > U+FFFD U+FFFD U+FFFD (1) > > rather than > > U+FFFD (2) > > It’s pretty clear what the intent of the encoder was there, I’d say, and > while we certainly don’t want to decode it as a NUL (that was the source of > previous security bugs, as I recall), I also don’t see the logic in > insisting that it must be decoded to *three* code points when it clearly > only represented one in the input. > In this case, It's pretty clear, but I don't see it as a general rule. Any rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or mojibake or random binary data. 88 A0 8B D4 is UTF-16 Chinese, but I'm not going to insist that it get replaced with U+FFFD U+FFFD because it's clear (to me) it was meant as two characters.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, 16 May 2017 10:01:03 +0300 Henri Sivonen via Unicodewrote: > Even so, I think even changing a recommendation of "best practice" > needs way better rationale than "feels right" or "ICU already does it" > when a) major browsers (which operate in the most prominent > environment of broken and hostile UTF-8) agree with the > currently-recommended best practice and b) the currently-recommended > best practice makes more sense for implementations where "UTF-8 > decoding" is actually mere "UTF-8 validation". There was originally an attempt to prescribe rather than to recommend the interpretation of ill-formed 8-bit Unicode strings. It may even briefly have been an issued prescription, until common sense prevailed. I do remember a sinking feeling when I thought I would have to change my own handling of bogus UTF-8, only to be relieved later when it became mere best practice. However, it is not uncommon for coding standards to prescribe 'best practice'. Richard.
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Mon, May 15, 2017 at 11:50 PM, Henri Sivonen via Unicode < unicode@unicode.org> wrote: > On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicode >wrote: > > I’m not sure how the discussion of “which is better” relates to the > > discussion of ill-formed UTF-8 at all. > > Clearly, the "which is better" issue is distracting from the > underlying issue. I'll clarify what I meant on that point and then > move on: > > I acknowledge that UTF-16 as the internal memory representation is the > dominant design. However, because UTF-8 as the internal memory > representation is *such a good design* (when legacy constraits permit) > that *despite it not being the current dominant design*, I think the > Unicode Consortium should be fully supportive of UTF-8 as the internal > memory representation and not treat UTF-16 as the internal > representation as the one true way of doing things that gets > considered when speccing stuff. > > I.e. I wasn't arguing against UTF-16 as the internal memory > representation (for the purposes of this thread) but trying to > motivate why the Consortium should consider "UTF-8 internally" equally > despite it not being the dominant design. > > So: When a decision could go either way from the "UTF-16 internally" > perspective, but one way clearly makes more sense from the "UTF-8 > internally" perspective, the "UTF-8 internally" perspective should be > decisive in *such a case*. (I think the matter at hand is such a > case.) > > At the very least a proposal should discuss the impact on the "UTF-8 > internally" case, which the proposal at hand doesn't do. > > (Moving on to a different point.) > > The matter at hand isn't, however, a new green-field (in terms of > implementations) issue to be decided but a proposed change to a > standard that has many widely-deployed implementations. Even when > observing only "UTF-16 internally" implementations, I think it would > be appropriate for the proposal to include a review of what existing > implementations, beyond ICU, do. > > Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick > test with three major browsers that use UTF-16 internally and have > independent (of each other) implementations of UTF-8 decoding > (Firefox, Edge and Chrome) Something I've learned through working with Node (V8 javascript engine from chrome) V8 stores strings either as UTF-16 OR UTF-8 interchangably and is not one OR the other... https://groups.google.com/forum/#!topic/v8-users/wmXgQOdrwfY and I wouldn't really assume UTF-16 is a 'majority'; Go is utf-8 for instance. > shows agreement on the current spec: there > is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, > 6 on the second, 4 on the third and 6 on the last line). Changing the > Unicode standard away from that kind of interop needs *way* better > rationale than "feels right". > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ > >
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 16 May 2017, at 08:22, Asmus Freytag via Unicodewrote: > I therefore think that Henri has a point when he's concerned about tacit > assumptions favoring one memory representation over another, but I think the > way he raises this point is needlessly antagonistic. That would be true if the in-memory representation had any effect on what we’re talking about, but it really doesn’t. (The only time I can think of that the in-memory representation has a significant effect is where you’re talking about default binary ordering of string data, in which case, in the presence of non-BMP characters, UTF-8 and UCS-4 sort the same way, but because the surrogates are “in the wrong place”, UTF-16 doesn’t. I think everyone is well aware of that, no?) >> Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick >> test with three major browsers that use UTF-16 internally and have >> independent (of each other) implementations of UTF-8 decoding >> (Firefox, Edge and Chrome) shows agreement on the current spec: there >> is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, >> 6 on the second, 4 on the third and 6 on the last line). Changing the >> Unicode standard away from that kind of interop needs *way* better >> rationale than "feels right”. In what sense is this “interop”? Under what circumstance would it matter how many U+FFFDs you see? If you’re about to mutter something about security, consider this: security code *should* refuse to compare strings that contain U+FFFD (or at least should never treat them as equal, even to themselves), because it has no way to know what that code point represents. Would you advocate replacing e0 80 80 with U+FFFD U+FFFD U+FFFD (1) rather than U+FFFD (2) It’s pretty clear what the intent of the encoder was there, I’d say, and while we certainly don’t want to decode it as a NUL (that was the source of previous security bugs, as I recall), I also don’t see the logic in insisting that it must be decoded to *three* code points when it clearly only represented one in the input. This isn’t just a matter of “feels nicer”. (1) is simply illogical behaviour, and since behaviours (1) and (2) are both clearly out there today, it makes sense to pick the more logical alternative as the official recommendation. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 15 May 2017, at 23:43, Richard Wordingham via Unicodewrote: > > The problem with surrogates is inadequate testing. They're sufficiently > rare for many users that it may be a long time before an error is > discovered. It's not always obvious that code is designed for UCS-2 > rather than UTF-16. While I don’t think we should spend too long debating the relative merits of UTF-8 versus UTF-16, I’ll note that that argument applies equally to both combining characters and indeed the underlying UTF-8 encoding in the first place, and that mistakes in handling both are not exactly uncommon. There are advantages to UTF-8 and advantages to UTF-16. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, May 16, 2017 at 9:50 AM, Henri Sivonenwrote: > Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick > test with three major browsers that use UTF-16 internally and have > independent (of each other) implementations of UTF-8 decoding > (Firefox, Edge and Chrome) shows agreement on the current spec: there > is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, > 6 on the second, 4 on the third and 6 on the last line). Changing the > Unicode standard away from that kind of interop needs *way* better > rationale than "feels right". Testing with that file, Python 3 and OpenJDK 8 agree with the currently-specced best-practice, too. I expect there to be other well-known implementations that comply with the currently-specced best practice, so the rationale to change the stated best practice would have to be very strong (as in: security problem with currently-stated best practice) for a change to be appropriate. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 5/15/2017 11:50 PM, Henri Sivonen via Unicode wrote: On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicodewrote: I’m not sure how the discussion of “which is better” relates to the discussion of ill-formed UTF-8 at all. Clearly, the "which is better" issue is distracting from the underlying issue. I'll clarify what I meant on that point and then move on: I acknowledge that UTF-16 as the internal memory representation is the dominant design. However, because UTF-8 as the internal memory representation is *such a good design* (when legacy constraits permit) that *despite it not being the current dominant design*, I think the Unicode Consortium should be fully supportive of UTF-8 as the internal memory representation and not treat UTF-16 as the internal representation as the one true way of doing things that gets considered when speccing stuff. There are cases where it is prohibitive to transcode external data from UTF-8 to any other format, as a precondition to doing any work. In these situations processing has to be done in UTF-8, effectively making that the in-memory representation. I've encountered this issue on separate occasions, both for my own code as well as code I reviewed for clients. I therefore think that Henri has a point when he's concerned about tacit assumptions favoring one memory representation over another, but I think the way he raises this point is needlessly antagonistic. At the very least a proposal should discuss the impact on the "UTF-8 internally" case, which the proposal at hand doesn't do. This is a key point. It may not be directly relevant to any other modifications to the standard, but the larger point is to not make assumption about how people implement the standard (or any of the algorithms). (Moving on to a different point.) The matter at hand isn't, however, a new green-field (in terms of implementations) issue to be decided but a proposed change to a standard that has many widely-deployed implementations. Even when observing only "UTF-16 internally" implementations, I think it would be appropriate for the proposal to include a review of what existing implementations, beyond ICU, do. I would like to second this as well. The level of documented review of existing implementation practices tends to be thin (at least thinner than should be required for changing long-established edge cases or recommendations, let alone core conformance requirements). Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick test with three major browsers that use UTF-16 internally and have independent (of each other) implementations of UTF-8 decoding (Firefox, Edge and Chrome) shows agreement on the current spec: there is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, 6 on the second, 4 on the third and 6 on the last line). Changing the Unicode standard away from that kind of interop needs *way* better rationale than "feels right". It would be good if the UTC could work out some minimal requirements for evaluating proposals for changes to properties and algorithms, much like the criteria for encoding new code points A./
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 15 May 2017, at 23:16, Shawn Steele via Unicodewrote: > > I’m not sure how the discussion of “which is better” relates to the > discussion of ill-formed UTF-8 at all. It doesn’t, which is a point I made in my original reply to Henry. The only reason I answered his anti-UTF-16 rant at all was to point out that some of us don’t think UTF-16 is a mistake, and in fact can see various benefits (*particularly* as an in-memory representation). > And to the last, saying “you cannot process UTF-16 without handling > surrogates” seems to me to be the equivalent of saying “you cannot process > UTF-8 without handling lead & trail bytes”. That’s how the respective > encodings work. Quite. Kind regards, Alastair. -- http://alastairs-place.net
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, May 16, 2017 at 6:23 AM, Karl Williamsonwrote: > On 05/15/2017 04:21 AM, Henri Sivonen via Unicode wrote: >> >> In reference to: >> http://www.unicode.org/L2/L2017/17168-utf-8-recommend.pdf >> >> I think Unicode should not adopt the proposed change. >> >> The proposal is to make ICU's spec violation conforming. I think there >> is both a technical and a political reason why the proposal is a bad >> idea. > > > > Henri's claim that "The proposal is to make ICU's spec violation conforming" > is a false statement, and hence all further commentary based on this false > premise is irrelevant. > > I believe that ICU is actually currently conforming to TUS. Do you mean that ICU's behavior differs from what the PDF claims (I didn't test and took the assertion in the PDF about behavior at face value) or do you mean that despite deviating from the currently-recommended best practice the behavior is conforming, because the relevant part of the spec is mere best practice and not a requirement? > TUS has certain requirements for UTF-8 handling, and it has certain other > "Best Practices" as detailed in 3.9. The proposal involves changing those > recommendations. It does not involve changing any requirements. Even so, I think even changing a recommendation of "best practice" needs way better rationale than "feels right" or "ICU already does it" when a) major browsers (which operate in the most prominent environment of broken and hostile UTF-8) agree with the currently-recommended best practice and b) the currently-recommended best practice makes more sense for implementations where "UTF-8 decoding" is actually mere "UTF-8 validation". -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On Tue, May 16, 2017 at 1:16 AM, Shawn Steele via Unicodewrote: > I’m not sure how the discussion of “which is better” relates to the > discussion of ill-formed UTF-8 at all. Clearly, the "which is better" issue is distracting from the underlying issue. I'll clarify what I meant on that point and then move on: I acknowledge that UTF-16 as the internal memory representation is the dominant design. However, because UTF-8 as the internal memory representation is *such a good design* (when legacy constraits permit) that *despite it not being the current dominant design*, I think the Unicode Consortium should be fully supportive of UTF-8 as the internal memory representation and not treat UTF-16 as the internal representation as the one true way of doing things that gets considered when speccing stuff. I.e. I wasn't arguing against UTF-16 as the internal memory representation (for the purposes of this thread) but trying to motivate why the Consortium should consider "UTF-8 internally" equally despite it not being the dominant design. So: When a decision could go either way from the "UTF-16 internally" perspective, but one way clearly makes more sense from the "UTF-8 internally" perspective, the "UTF-8 internally" perspective should be decisive in *such a case*. (I think the matter at hand is such a case.) At the very least a proposal should discuss the impact on the "UTF-8 internally" case, which the proposal at hand doesn't do. (Moving on to a different point.) The matter at hand isn't, however, a new green-field (in terms of implementations) issue to be decided but a proposed change to a standard that has many widely-deployed implementations. Even when observing only "UTF-16 internally" implementations, I think it would be appropriate for the proposal to include a review of what existing implementations, beyond ICU, do. Consider https://hsivonen.com/test/moz/broken-utf-8.html . A quick test with three major browsers that use UTF-16 internally and have independent (of each other) implementations of UTF-8 decoding (Firefox, Edge and Chrome) shows agreement on the current spec: there is one REPLACEMENT CHARACTER per bogus byte (i.e. 2 on the first line, 6 on the second, 4 on the third and 6 on the last line). Changing the Unicode standard away from that kind of interop needs *way* better rationale than "feels right". -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/