Re: Corrigendum #9
On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: Much as I don't like their uninvited use, it is possible to pass them and other undesirables through most applications by a slight bit of recoding at the application's boundaries. Using 99 = (3 + 32 + 64) PUA characters, one can ape UTF-16 surrogates and encode: What's the point? If we can use the PUA, then we don't need the noncharacters; we can just use the PUA directly. If we have to play around with remapping them, they're pointless; they're no easier to use in that case then ESC or '\' or PUA characters. -- Kie ekzistas vivo, ekzistas espero. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Mon, Jun 2, 2014 at 10:32 PM, David Starner prosfil...@gmail.com wrote: Why? It seems you're changing the rules ... This isn't are changing, it is has changed. The Corrigendum was issued at the start of 2013, about 16 months ago; applicable to all relevant earlier versions. It was the result of fairly extensive debate inside the UTC; there hasn't been a single issue on this thread that wasn't considered during the discussions there. And as far back as 2001, the UTC made it clear that noncharacters *are* scalar values, and are to be converted by UTF converters. Eg, see http://www.unicode.org/mail-arch/unicode-ml/y2001-m09/0149.html (by chance, one day before 9/11). probably trigger serious bugs in some lamebrained utility. There were already plenty of programs that passed the noncharacters through; very few would filter them (some would delete them, which is horrible for security). Thinking that a utility would never encounter them in input text was a pipe-dream. If a utility or library is so fragile that it *breaks* on input of any valid UTF sequence, then it *is* a lamebrained utility. A good unit test for any production chain would be to check there is no crash on any input scalar value (and for that matter, any ill-formed UTF text). ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On 2014/06/03 07:08, Asmus Freytag wrote: On 6/2/2014 2:53 PM, Markus Scherer wrote: On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com mailto:prosfil...@gmail.com wrote: I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious bugs in some lamebrained utility. I don't expect handling these in web browsers and lamebrained utilities. I expect treat like unassigned code points. Expecting them to be treated like unassigned code points shows that their use is a bad idea: Since when does the Unicode Consortium use unassigned code points (and the like) in plain sight? I can't shake the suspicion that Corrigendum #9 is not actually solving a general problem, but is a special favor to CLDR as being run by insiders, and in the process muddying the waters for everyone else. I have to fully agree with Asmus, Richard, Shawn and others that the use of non-characters in CLDR is a very bad and dangerous example. However convenient the misuse of some of these codepoints in CLDR may be, it sets a very bad example for everybody else. Unicode itself should not just be twice as careful with the use of its own codepoints, but 10 times as careful. I'd strongly suggest that completely independent of when and how Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets worked out for how to get rid of these codepoints in CLDR data. The sooner, the better. Regards, Martin. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Mon, 2 Jun 2014 23:21:38 -0700 David Starner prosfil...@gmail.com wrote: On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: Using 99 = (3 + 32 + 64) PUA characters, one can ape UTF-16 surrogates and encode: What's the point? If we can use the PUA, then we don't need the noncharacters; we can just use the PUA directly. If we have to play around with remapping them, they're pointless; they're no easier to use in that case then ESC or '\' or PUA characters. A search for two 2-character string '\n' would also find a substring of 4-character string 'a\\n'. The PUA is in general not available for general utilities to make special use of. Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Mon, Jun 2, 2014 at 11:55 PM, Mark Davis ☕️ m...@macchiato.com wrote: Thinking that a utility would never encounter them in input text was a pipe-dream. Thinking that a utility would never mangle them if encountered in input text was a pipe-dream. If a utility or library is so fragile that it breaks on input of any valid UTF sequence, then it is a lamebrained utility. And? The world is filled with lamebrained utilities, and being cautious about what you take in can prevent one of those lamebrained utilities from turning into an exploit. A good unit test for any production chain would be to check there is no crash on any input scalar value (and for that matter, any ill-formed UTF text). Right; and if you filter out stuff at the frontend, like ill-formed UTF text and noncharacters, you don't have to worry about what the middle end will do with them. I don't get what the goal of these changes were. It seems you've taken these characters away from programmers to use them in programs and given them to CLDR and anyone else willing to make their plain text files skirt the limits. -- Kie ekzistas vivo, ekzistas espero. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Tue, Jun 3, 2014 at 12:31 AM, Richard Wordingham richard.wording...@ntlworld.com wrote: On Mon, 2 Jun 2014 23:21:38 -0700 David Starner prosfil...@gmail.com wrote: On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: Using 99 = (3 + 32 + 64) PUA characters, one can ape UTF-16 surrogates and encode: The PUA is in general not available for general utilities to make special use of. No, the PUA is not. Then where are you getting the 99 PUA characters you suggested using? -- Kie ekzistas vivo, ekzistas espero. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Tue, 3 Jun 2014 08:55:09 +0200 Mark Davis ☕️ m...@macchiato.com wrote: On Mon, Jun 2, 2014 at 10:32 PM, David Starner prosfil...@gmail.com wrote: Why? It seems you're changing the rules ... This isn't are changing, it is has changed. The Corrigendum was issued at the start of 2013, about 16 months ago; applicable to all relevant earlier versions. It was the result of fairly extensive debate inside the UTC; there hasn't been a single issue on this thread that wasn't considered during the discussions there. And as far back as 2001, the UTC made it clear that noncharacters *are* scalar values, and are to be converted by UTF converters. Eg, see http://www.unicode.org/mail-arch/unicode-ml/y2001-m09/0149.html (by chance, one day before 9/11). But that says U+FDD0 is not to be externally interchanged! Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Tue, Jun 3, 2014 at 9:41 AM, David Starner prosfil...@gmail.com wrote: Thinking that a utility would never mangle them if encountered in input text was a pipe-dream. I didn't say not mangle, I said break, as in crash. I don't think this thread is going anywhere productive, so I'm signing off from it. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
I think his point is that an application may want to encapsulate in a valid text any orbitrary stream of code points (including non characters, PUAs, or isolated surrogate code units found in 16-bit or 32-bit streams that are invalid UTF-16 or UTF-32 streams, or even invalid arbitrary 8-but bytes in streams that are not valid UTF-8). For 8-bit streams, using ESC or \ s generally a good choice of escape to derive a valid UTF-8 text stream. But for 16-bit and 32-bit stream, PUAs are more economical (but PUA code units found in the stream still need to be escaped. If you think about the Java regexp \\uD800, it does not designates a code point but only a code unit which is not valid plain text alone as it violates UTF-16 encoding rules. Trying to match it in a valid UTF-16 stream can work only if you can reprecent isolated code units for a specific encoding like UTF-16, even if the targer stream to look for this match uses any other valid UTF (not necessarily UTF-16: decode the target text, reencode it to UTF-16 to generate a 16-bit stream in which you'll look for isolated 16-but code units with the regexp) So yes the regexp \\u (in Java source) is not used to match a single valid character 2014-06-03 8:21 GMT+02:00 David Starner prosfil...@gmail.com: On Mon, Jun 2, 2014 at 4:33 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: Much as I don't like their uninvited use, it is possible to pass them and other undesirables through most applications by a slight bit of recoding at the application's boundaries. Using 99 = (3 + 32 + 64) PUA characters, one can ape UTF-16 surrogates and encode: What's the point? If we can use the PUA, then we don't need the noncharacters; we can just use the PUA directly. If we have to play around with remapping them, they're pointless; they're no easier to use in that case then ESC or '\' or PUA characters. -- Kie ekzistas vivo, ekzistas espero. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Use of Unicode Symbol 26A0
Good Day - Just wondering if Unicode provides for or anyone know of documentation for standard usage around the following symbol: [cid:image001.png@01CF7C48.A6D54D00] Noticed that is it used in many applications as a general warning or error symbol, but upon research it is also the symbol for personal injury so appears to be a conflict of meaning. Any information around standard usage of the symbol in software applications is appreciated. Thank you! Michelle ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On 6/2/2014 3:08 PM, Asmus Freytag wrote: On 6/2/2014 2:53 PM, Markus Scherer wrote: On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com mailto:prosfil...@gmail.com wrote: I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious bugs in some lamebrained utility. I don't expect handling these in web browsers and lamebrained utilities. I expect treat like unassigned code points. I can't shake the suspicion that Corrigendum #9 is not actually solving a general problem, but is a special favor to CLDR as being run by insiders, and in the process muddying the waters for everyone else. Clarifying: I still haven't heard from anyone that this solves a general problem that is widespread. The only actual example has always been CLDR, and its decision to ship these code points in XML. Shipping these code points in files was pretty far down the list of what not to do when they were originally adopted. My view continues to be that this is was a questionable design decision by CLDR, given what was on the record. The reaction of several outside implementers during this discussion makes clear that viewing that design as problematic is not just my personal view. Usually, if there's a discrepancy between an implementation and Unicode, the reaction is not to retract conformance language. I think arriving at this decision was easier for the UTC, because CLDR is not a random, unrelated implementation. And, as in any group, it's perhaps easier to not be as keenly aware of the impact on external implementations. So, I'd like to clarify, that this is the sense in which I meant special favor, and which therefore is not the most felicitous expression to describe what I had in mind. A./ A./ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Use of Unicode Symbol 26A0
Michelle, Unicode normally does not document all known usages of symbols. Occasionally, if a symbol is used in ways that might be unexpected from its name, the standard may add an alias or annotation. This is done in particular, when there is a question of whether a given symbol is the correct choice for a given application - especially if Unicode contains multiple, similar symbols. In this case, that does not seem the case. The symbol is used for a variety of purposes, from warning to error to alerting readers to important information. These all seem to fit in the same general usage as suggested by the name, and the symbol is distinct enough so that that there is no other symbol in Unicode that might suggest itself as an alternate. The use to warn about risk of personal injury would not seem to demand additional clarification. A./ On 6/3/2014 7:25 AM, Papendick, Michelle wrote: Good Day – Just wondering if Unicode provides for or anyone know of documentation for standard usage around the following symbol: cid:image001.png@01CF7C48.A6D54D00 Noticed that is it used in many applications as a general warning or error symbol, but upon research it is also the symbol for personal injury so appears to be a conflict of meaning. Any information around standard usage of the symbol in software applications is appreciated. Thank you! Michelle __ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
Nicely put. A./ On 6/3/2014 12:09 AM, Martin J. Dürst wrote: On 2014/06/03 07:08, Asmus Freytag wrote: On 6/2/2014 2:53 PM, Markus Scherer wrote: On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com mailto:prosfil...@gmail.com wrote: I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious bugs in some lamebrained utility. I don't expect handling these in web browsers and lamebrained utilities. I expect treat like unassigned code points. Expecting them to be treated like unassigned code points shows that their use is a bad idea: Since when does the Unicode Consortium use unassigned code points (and the like) in plain sight? I can't shake the suspicion that Corrigendum #9 is not actually solving a general problem, ... I have to fully agree with Asmus, Richard, Shawn and others that the use of non-characters in CLDR is a very bad and dangerous example. However convenient the misuse of some of these codepoints in CLDR may be, it sets a very bad example for everybody else. Unicode itself should not just be twice as careful with the use of its own codepoints, but 10 times as careful. I'd strongly suggest that completely independent of when and how Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets worked out for how to get rid of these codepoints in CLDR data. The sooner, the better. Regards, Martin. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Use of Unicode Symbol 26A0
2014-06-03 19:13, Asmus Freytag wrote: Unicode normally does not document all known usages of symbols. Not to mention unknown usages. Characters will be used in different ways, no matter what the Unicode Standard says, and it would be mostly pointless to put restrictions on it. In some cases, however, some types of usage are warned against, or better approaches are suggested˔. The symbol is used for a variety of purposes, from warning to error to alerting readers to important information. These all seem to fit in the same general usage as suggested by the name, and the symbol is distinct enough so that that there is no other symbol in Unicode that might suggest itself as an alternate. Right, but if we consider the use of WARNING SIGN as a text character, or contexts where an image resembling WARNING SIGN is used and WARNING SIGN could well be used (with the usual caveats), then it seems to generally indicate a warning message as opposite to an error message, on one hand, and a purely informative note, on the other. The use of graphic symbols similar to WARNING SIGN e.g. in traffic signs is really a different issue and external to Unicode, as it is not about characters, though it might be tangentially related. The use to warn about risk of personal injury would not seem to demand additional clarification. On the practical side, it might be in order to warn against usage that relies on some particular interpretation like that. What I mean is that it is OK to use WARNING SIGN as warning about risk of personal injury, but questionable to expect that people will generally take it that way (and not more loosely as warning of some kind). Yucca ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
UTF-16 Encoding Scheme and U+FFFE
How do I read definition D98 in TUS Version 6.3.0 Chapter 3 to prohibit a file in the UTF-16 encoding scheme from starting with U+FFFE? Or is U+FFFE actually allowed to start such a file? Is an implementation that deduces the encoding scheme of a plain text file from a leading BOM to be characterised as reckless? Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Tue, 03 Jun 2014 16:09:27 +0900 Martin J. Dürst due...@it.aoyama.ac.jp wrote: I'd strongly suggest that completely independent of when and how Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets worked out for how to get rid of these codepoints in CLDR data. The sooner, the better. I suspect this has already been done. I know of no CLDR text files still containing them. Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: UTF-16 Encoding Scheme and U+FFFE
There's never been anything preventing a file from containing and beginning with U+FFFE. It's just not a very useful thing to do, hence not very likely. Peter -Original Message- From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard Wordingham Sent: June 3, 2014 11:53 AM To: unicode@unicode.org Subject: UTF-16 Encoding Scheme and U+FFFE How do I read definition D98 in TUS Version 6.3.0 Chapter 3 to prohibit a file in the UTF-16 encoding scheme from starting with U+FFFE? Or is U+FFFE actually allowed to start such a file? Is an implementation that deduces the encoding scheme of a plain text file from a leading BOM to be characterised as reckless? Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unicode Regular Expressions, Surrogate Points and UTF-8
On 06/02/2014 01:01 PM, Richard Wordingham wrote: On Mon, 2 Jun 2014 11:29:09 +0200 Mark Davis ☕️m...@macchiato.com wrote: \uD808\uDF45 specifies a sequence of two codepoints. That is simply incorrect. The above is in the sample notation of UTS #18 Version 17 Section 1.1. From what I can make out, the corresponding Java notation would be \x{D808}\x{DF45}. I don't *know* what \x{D808} and \x{DF45} match in Java, or whether they are even acceptable. The only thing UTS #18 RL1.7 permits them to match in Java is lone surrogates, but I don't know if Java complies. The notation for \uD808\uDF45 is interpreted as a supplementary codepoint and is represent internally as a pair of surrogates in String. Pattern.compile(\\x{D808}\\x{DF45}).matcher(\ud808\udf45).find()); - false Pattern.compile(\uD808\uDF45).matcher(\ud808\udf45).find());- true Pattern.compile(\\x{D808}).matcher(\ud808\udf45).find()); - false Pattern.compile(\\x{D808}).matcher(\ud808_\udf45).find()); - true -Sherman All UTS #18 says for sure about regular expressions matching code units is that they don't satisfy RL1.1, though Section 1.7 appears to ban them when it says, A fundamental requirement is that Unicode text be interpreted semantically by code point, not code units. Perhaps it's a fundamental requirement of something other than UTS #18. I thought matching parts of characters in terms of their canonical equivalences was awkward enough, without having the additional option of matching some of the code units! ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unicode Regular Expressions, Surrogate Points and UTF-8
On Tue, 03 Jun 2014 15:06:30 -0700 Xueming Shen xueming.s...@oracle.com wrote: On 06/02/2014 01:01 PM, Richard Wordingham wrote: On Mon, 2 Jun 2014 11:29:09 +0200 Mark Davis ☕️m...@macchiato.com wrote: \uD808\uDF45 specifies a sequence of two codepoints. That is simply incorrect. The above is in the sample notation of UTS #18 Version 17 Section 1.1. From what I can make out, the corresponding Java notation would be \x{D808}\x{DF45}. I don't *know* what \x{D808} and \x{DF45} match in Java, or whether they are even acceptable. The only thing UTS #18 RL1.7 permits them to match in Java is lone surrogates, but I don't know if Java complies. The notation for \uD808\uDF45 is interpreted as a supplementary codepoint and is represent internally as a pair of surrogates in String. Pattern.compile(\\x{D808}\\x{DF45}).matcher(\ud808\udf45).find()); - false Pattern.compile(\uD808\uDF45).matcher(\ud808\udf45).find()); - true Pattern.compile(\\x{D808}).matcher(\ud808\udf45).find()); - false Pattern.compile(\\x{D808}).matcher(\ud808_\udf45).find()); - true Thank you for providing examples confirming that what in the UTS #18 *sample* notation would be written \uD808\uDF45, i.e. \x{D808}\x{DF45} in Java notation, matches nothing in any 16-bit Unicode string. Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: UTF-16 Encoding Scheme and U+FFFE
On Tue, 3 Jun 2014 21:28:05 + Peter Constable peter...@microsoft.com wrote: There's never been anything preventing a file from containing and beginning with U+FFFE. It's just not a very useful thing to do, hence not very likely. Well, while U+FFFE was apparently prohibited from public interchange, one could be very confident of not finding it in an external file. As an internally generated file, it would then be much more likely to be in the UTF-16BE or UTF-16LE encoding scheme. Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: UTF-16 Encoding Scheme and U+FFFE
You cannot even be very confident of not finding actual ill-formed UTF-16, like unpaired surrogates, in an external file, let alone noncharacters. As for the noncharacters, take a look at the collation test files that we distribute with each version of UCA. The test data includes test strings like the following, to verify that UCA implementations do the correct thing when faced with unusual edge cases: FFFE 0021 FFFE 003F FFFE 0061 FFFE 0041 FFFE 0062 1FFFE 0021 1FFFE 003F 1FFFE 0334 ... As well as test strings starting with unpaired surrogates: D800 0021 D800 003F D800 0061 D800 0041 D800 0062 And while it is true that the *file* CollationTest_SHIFTED.txt doesn't start with either a noncharacter or an unpaired surrogate -- because all of the test data in it is represented in ASCII hex strings instead of directly in UTF-16 -- the issue in any case isn't whether a *file* starts with a noncharacter, but whether a UTF-16 *string* starts with a noncharacter. Any one of those test strings could be trivially turned into a text file by piping out that one UTF-16 string to a file. And I could then write conformant test software that would read UTF-16 string input data from that file and run it through the UCA algorithm to construct sortkeys for it. As Peter said, the main thing that prevents running into these is that it isn't very *useful* to start off files (or strings) with U+FFFE. (And, additionally, in the case of UTF-16 text data files, it would be confusing and possibly lead to misinterpretation of byte order, if you were somehow depending solely on initial BOMs -- which I wouldn't advise, anyway.) Basically, the rules of standards (e.g., you shouldn't try to publicly interchange noncharacters) are not like laws of physics. Just because the standard says you shouldn't do it doesn't mean it doesn't happen. --Ken On Tue, 3 Jun 2014 21:28:05 + Peter Constable peter...@microsoft.com wrote: There's never been anything preventing a file from containing and beginning with U+FFFE. It's just not a very useful thing to do, hence not very likely. Well, while U+FFFE was apparently prohibited from public interchange, one could be very confident of not finding it in an external file. As an internally generated file, it would then be much more likely to be in the UTF-16BE or UTF-16LE encoding scheme. Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode