Re: If X sorts before Y, then XZ sorts before YZ ... example of where that's not true?
On 07/01/2013, Costello, Roger L. coste...@mitre.org wrote: Hi Folks, In the book, Unicode Demystified (p. xxii) it says: An English-speaking programmer might assume, for example, that given the three characters X, Y, and Z, that if X sorts before Y, then XZ sorts before YZ. This works for English, but fails for many languages. Would you give an example of where character 1 sorts before character 2 but character 1, character 3 does not sort before character 2, character 3? /Roger Look at the collation for Dzongkha or Tibetan: http://developer.mimer.com/charts/dzongkha.htm http://developer.mimer.com/charts/tibetan.htm
Re: Why is endianness relevant when storing data on disks but not when in memory?
Doug Ewell, Sun, 6 Jan 2013 20:57:58 -0700: We are pretty much going round and round on this. The bottom line for me is, it would be nice if there were a shorthand way of saying big-endian UTF-16, and many people (including you?) feel that UTF-16BE is that way, but it is not. That term has a DIFFERENT MEANING. The following stream: FE FF 00 48 00 65 00 6C 00 6C 00 6F is valid big-endian UTF-16, but it is NOT valid UTF-16BE unless the leading U+FEFF is explicitly meant as a zero-width no-break space, which may not be stripped. I don't remember if the RFC defines one of the 3 MIME charsets as the default, but given that UTF-16 is supposed to be used whenever one doesn't know the endianness, then it seems logical to assume that the above example defaults to be treated as UTF-16. But apart from that, then we can also say that the example also not valid UTF-16, unless the U+FEFF is meant as a BOM … I see the 3 as 3 MIME charsets. It does anyhow seem like a definition question. -- leif h silli
Re: Why is endianness relevant when storing data on disks but not when in memory?
Doug Ewell, Sun, 6 Jan 2013 20:57:58 -0700: Doug Ewell, Sun, 6 Jan 2013 20:57:58 -0700: The bottom line for me is, it would be nice if there were a shorthand way of saying big-endian UTF-16, and many people (including you?) feel that UTF-16BE is that way, but it is not. One could say UTF-16, big-endian. Or big-endian UTF-16. That’s pretty short. That term has a DIFFERENT MEANING. The following stream: FE FF 00 48 00 65 00 6C 00 6C 00 6F is valid big-endian UTF-16, but it is NOT valid UTF-16BE unless the leading U+FEFF is explicitly meant as a zero-width no-break space, which may not be stripped. I believe I understand this reasonably well. I think we are looking for a term is unaffacted by how we label it. leif halvard silli
Re: What does it mean to not be a valid string in Unicode?
Unicode libraries commonly provide functions that take a code point and return a value, for example a property value. Such a function normally accepts the whole range 0..10 (and may even return a default value for out-of-range inputs). Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. markus
RE: What does it mean to not be a valid string in Unicode?
Markus Scherer markus dot icu at gmail dot com wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. But still non-conformant. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell
Re: What does it mean to not be a valid string in Unicode?
On Mon, Jan 7, 2013 at 10:48 AM, Doug Ewell d...@ewellic.org wrote: Markus Scherer markus dot icu at gmail dot com wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. But still non-conformant. Not really, that's why there is a definition of a 16-bit Unicode string in the standard. markus
Re: What does it mean to not be a valid string in Unicode?
But still non-conformant. That's incorrect. The point I was making above is that in order to say that something is non-conformant, you have to be very clear what it is non-conformant *TO* . Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). - That *is* conformant for *Unicode 16-bit strings.* - That is *not* conformant for *UTF-16*. There is an important difference. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Jan 7, 2013 at 10:48 AM, Doug Ewell d...@ewellic.org wrote: But still non-conformant.
RE: What does it mean to not be a valid string in Unicode?
You're right, and I stand corrected. I read Markus's post too quickly. Mark Davis ☕ mark at macchiato dot com wrote: But still non-conformant. That's incorrect. The point I was making above is that in order to say that something is non-conformant, you have to be very clear what it is non-conformant TO. Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). + That is conformant for Unicode 16-bit strings. + That is not conformant for UTF-16. There is an important difference. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell
Re: What does it mean to not be a valid string in Unicode?
Well then I don't know why you need a definition of an Unicode 16-bit string. For me it just means exactly the same as 16-bit string, and the encoding in it is not relevant given you can put anything in it without even needing to be conformant to Unicode. So a Java string is exactly the same, a 16-bit string. The same also as Windows API 16-bit strings, or wide strings in a C compiler where wide is mapped by a compiler option to 16-bit code units for wchar_t (or short but more safely as UINT16 if you don't want to be dependant of compiler options or OS environments when compiling, when you need to manage the exact memory allocation), or the same as a U-string in Perl. Only UTF-16 (not UTF-16BE and UTF-16LE which are encoding schemes with concreate byte orders, without any leading BOM) is relevant to Unicode because a 16-bit string does not itself specify any encoding scheme or byte order. One confusion comes with the name UTF-16 when it is also used as an encoding scheme with a possible leading BOM and implied default UTF-16LE determined by guesses on the first few characters : this encoding scheme (with support of BOM and implicit guess of byte order if it's missing) should have been given a distinct encoding name like 'UTF-16XE. Reserving UTF-16 for what the stadnard discusses as a 16-bit string, except that it should still require UTF-16 conformance (no unpaired surrogates and no non-characters) plus **no** BOM supported for this level (which is still not materialized by a concrete byte order or by an implicit size in storage bits, as long as it can store distinctly the whole range of code units 0x..0x minus the few non-characters, and enforces all surrogates to be paired, but does not enforce any character to be allocated). Note that such relaxed version of UTF-16 would still allow an internal alternate representation of 0x for interoperating with various APIs without changing the storage requirement : 0x could perfectly be used to replace 0x if that last code units plays a special role as a string terminator. But even if this is done, a storage unit like 0x would still be percied as if it was really the code unit 0x. In other words, the concept of completely relaxed Unicode 16-bit string is unneeded, given that it's single requirement is to make sure that it defines a length in terms of 16-bit code units, and code units being large enough to store any unsigned 16-bit value (internally it could still be 18-bit on systems with 6-bit or 9-bit addressable memory cells ; the sizeof() property of this code units could still be 2, or 3 or other, as long as it is large enough to store the value. On some devices (not so exotic...) there are memory areas that is 4-bit addressable or even 1-bit addressable (in that later case the sizeof() property for the code unit type would return 16, not 2). Some devices only have 16-bit or 32-bit addressable memory and sizeof() would return 1 (and the C types char and wchar_t would most likely be the same). 2013/1/7 Doug Ewell d...@ewellic.org: You're right, and I stand corrected. I read Markus's post too quickly. Mark Davis ☕ mark at macchiato dot com wrote: But still non-conformant. That's incorrect. The point I was making above is that in order to say that something is non-conformant, you have to be very clear what it is non-conformant TO. Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). + That is conformant for Unicode 16-bit strings. + That is not conformant for UTF-16. There is an important difference. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell
RE: What does it mean to not be a valid string in Unicode?
Philippe Verdy said: Well then I don't know why you need a definition of an Unicode 16-bit string. For me it just means exactly the same as 16-bit string, and the encoding in it is not relevant given you can put anything in it without even needing to be conformant to Unicode. So a Java string is exactly the same, a 16-bit string. The same also as Windows API 16-bit strings, or wide strings in a C compiler where wide is mapped by a compiler option to 16-bit code units for wchar_t ... And elaborating on Mark's response a little: [0x0061,0x0062,0x4E00,0x,0x0410] Is a Unicode 16-bit string. It contains a, b, a Han character, a noncharacter, and a Cyrillic character. Because it is also well-formed as UTF-16, it is also a UTF-16 string, by the definitions in the standard. [0x0061,0xD800,0x4E00,0x,0x0410] Is a Unicode 16-bit string. It contains a, a high-surrogate code unit, a Han character, a noncharacter, and a Cyrillic character. Because an unpaired high-surrogate code unit is not allowed in UTF-16, this is *NOT* a UTF-16 string. On the other hand, consider: [0x0061,0x0062,0x88EA,0x8440] That is *NOT* a Unicode 16-bit string. It contains a, b, a Han character, and a Cyrillic character. How do I know? Because I know the character set context. It is a wchar_t implementation of the Shift-JIS code page 932. The difference is the declaration of the standard one uses to interpret what the 16-bit units mean. In a Unicode 16-bit string I go to the Unicode Standard to figure out how to interpret the numbers. In a wide code Page 932 string I go to the specification of Code Page 932 to figure out how to interpret the numbers. This is no different, really, than talking about a Latin-1 string versus a KOI-8 string. --Ken
RE: What does it mean to not be a valid string in Unicode?
Philippe also said: ... Reserving UTF-16 for what the stadnard discusses as a 16-bit string, except that it should still require UTF-16 conformance (no unpaired surrogates and no non-characters) ... For those following along, conformance to UTF-16 does *NOT* require no non-characters. Noncharacters are perfectly valid in UTF-16. --Ken
Are there Unicode processors?
Hi Folks, An XML processor breaks up an XML document into its parts -- here's a start tag, here's element content, here's an end tag, etc. -- and then makes those parts (along with information about each part such as this part is a start tag and this part is element content) available to XML applications via an API. Are there Unicode processors? That is, are there processors that break up Unicode text into its parts -- here's a character, here's another character, here's still another character, etc. -- and then makes those parts (along with information about each part such as this part is the Latin Capital Letter T and this part is the Latin Small Letter o) available to Unicode applications (such as XML processors) via an API? I did a Google search for Unicode processor and came up empty so I am guessing the answer is that there are no Unicode processors. Or perhaps they go by a different name? If there are no Unicode processors, why not? /Roger
Re: Are there Unicode processors?
On Mon, Jan 7, 2013 at 2:34 PM, Costello, Roger L. coste...@mitre.org wrote: Are there Unicode processors? That is, are there processors that break up Unicode text into its parts -- here's a character, here's another character, here's still another character, etc. -- and then makes those parts (along with information about each part such as this part is the Latin Capital Letter T and this part is the Latin Small Letter o) available to Unicode applications (such as XML processors) via an API? I did a Google search for Unicode processor and came up empty so I am guessing the answer is that there are no Unicode processors. Or perhaps they go by a different name? If there are no Unicode processors, why not? I don't really think I understand what you want. KR C had this, at least for the ASCII subset of Unicode; it has arrays of characters and you can access each character individually. If you want to know if the third character in your array s is the Latin capital letter T, you write s[2] == T. If you want to know if it's a letter, you write isalpha(s[2]). Naturally speaking, Unicode support is slightly more complex, but it's still a matter of sequences of characters and functions to query the properties. It's plain text, it doesn't have XML's complex hierarchical features. -- Kie ekzistas vivo, ekzistas espero.
Re: Are there Unicode processors?
That is not the typical way that Unicode text is processed. Typically whatever OS you are using will supply mechanisms for iterating through any Unicode string, returning each of the code points. It may also offer APIs for returning information about each character (called 'property values', or you can get libraries like ICU (http://site.icu-project.org/) that have full-featured property support ( http://userguide.icu-project.org/strings/properties). Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Jan 7, 2013 at 2:34 PM, Costello, Roger L. coste...@mitre.orgwrote: Hi Folks, An XML processor breaks up an XML document into its parts -- here's a start tag, here's element content, here's an end tag, etc. -- and then makes those parts (along with information about each part such as this part is a start tag and this part is element content) available to XML applications via an API. Are there Unicode processors? That is, are there processors that break up Unicode text into its parts -- here's a character, here's another character, here's still another character, etc. -- and then makes those parts (along with information about each part such as this part is the Latin Capital Letter T and this part is the Latin Small Letter o) available to Unicode applications (such as XML processors) via an API? I did a Google search for Unicode processor and came up empty so I am guessing the answer is that there are no Unicode processors. Or perhaps they go by a different name? If there are no Unicode processors, why not? /Roger
RE: Are there Unicode processors?
Unicode processor?? If what you're looking for is code that breaks text into grapheme clusters/words/lines/etc., that's called text segmentation and is described in: http://www.unicode.org/reports/tr29/ But you go on to talk about characters and their properties.. if you're looking for APIs that provide access to stuff like Unicode character properties, programming languages or libraries provide such capabilities (Java, perl, Python, ICU...) in various appropriate ways. See, for example: http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html Or: http://perldoc.perl.org/5.14.0/perlunicode.html#Unicode-Character-Properties Or: http://userguide.icu-project.org/strings/properties Addison Addison Phillips Globalization Architect (Lab126) Chair (W3C I18N WG) Internationalization is not a feature. It is an architecture. -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Costello, Roger L. Sent: Monday, January 07, 2013 2:35 PM To: unicode@unicode.org Subject: Are there Unicode processors? Hi Folks, An XML processor breaks up an XML document into its parts -- here's a start tag, here's element content, here's an end tag, etc. -- and then makes those parts (along with information about each part such as this part is a start tag and this part is element content) available to XML applications via an API. Are there Unicode processors? That is, are there processors that break up Unicode text into its parts -- here's a character, here's another character, here's still another character, etc. -- and then makes those parts (along with information about each part such as this part is the Latin Capital Letter T and this part is the Latin Small Letter o) available to Unicode applications (such as XML processors) via an API? I did a Google search for Unicode processor and came up empty so I am guessing the answer is that there are no Unicode processors. Or perhaps they go by a different name? If there are no Unicode processors, why not? /Roger
Re: What does it mean to not be a valid string in Unicode?
On 2013/01/08 3:27, Markus Scherer wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. Regards, Martin.
Re: What does it mean to not be a valid string in Unicode?
That's not the point (see successive messages). Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Jan 7, 2013 at 4:59 PM, Martin J. Dürst due...@it.aoyama.ac.jpwrote: On 2013/01/08 3:27, Markus Scherer wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. Regards, Martin.
Q is a Roman numeral?
This isn't directly related to Unicode, but I thought this would be a good place to ask. Specifically, I'm curious about figure 14 (Gordon 1982) from WG2 N3218 [http://std.dkuug.dk/jtc1/sc2/wg2/docs/N3218.pdf], which says: Whereas our so-called Arabic numerals are ten in number (0–9), the Roman nu- merals number nine: I = 1 (one), V = 5, X = 10, L = 50, C = 100, Đ = 500 (D reg- ularly with middle bar, the modern form being simply D), a symbol for 1,000 (see below), Q = 500,000, and a rather strange symbol for 6: ↅ. Now that Q = 500,000 bit seems a little odd to me. I've never seen that anywhere else. Does anyone know where it came from? Is there real usage of Q for 500,000? —Ben Scarborough
RE: What does it mean to not be a valid string in Unicode?
Martin, The kind of situation Markus is talking about is illustrated particularly well in collation. And there is a section 7.1.1 in UTS #10 specifically devoted to this issue,: http://www.unicode.org/reports/tr10/#Handline_Illformed When weighting Unicode 16-bit strings for collation, you can, of course, always detect an unpaired surrogate and return an error code or throw an exception, but that may not be the best strategy for an implementation. The problem derives in part from the fact that for sorting, the comparison routine is generally buried deep down as a primitive comparison function in what may be a rather complicated sorting algorithm. Those algorithms often assume that the comparison routine is analogous to strcmp(), and will always return -1/0/1 (or negative/0/positive), and that it is not going to fail because it decides that some byte value in an input string is not valid in some particular character encoding. (Of course, the calling code needs to ensure it isn't handing off null pointers or unallocated objects, but that is par for the course for any string handling.) Now if I want to adopt a particular sorting algorithm so it uses a UCA-compliant, multi-level collation algorithm for the actual string comparison, then by far the easiest way to do so is to build a function essentially comparable to strcmp() in structure, e.g. UCA_strcmp(context, string1, string2), which also always returns -1/0/1 for any two Unicode 16-bit strings. If I introduce a string validation aspect to this comparison routine, and return an error code or raise an exception, then I run the risk of marginally slowing down the most time-critical part of the sorting loop, as well as complicating the adaptation of the sorting code, to deal with extra error conditions. It is faster, more reliable and robust, and easier to adapt the code, if I simply specify for the weighting exactly what happens to any isolated surrogate in input strings, and compare accordingly. Hence the two alternative strategies suggested in Section 7.1.1 of UTS #10: either weight each maximal ill-for! med subsequence as if it were U+FFFD (with a primary weight), or weight each surrogate code point with a generated implicit weight, as if it were an unassigned code point. Either strategy works. And in fact, the conformance tests in CollationTest.zip for UCA include some ill-formed strings in the test data, so that implementations can test their handling of them, if they choose. So in this kind of a case, what we are actually dealing with is: garbage in, principled, correct results out. ;-) --Ken -Original Message- On 2013/01/08 3:27, Markus Scherer wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. Regards, Martin.
RE: What does it mean to not be a valid string in Unicode?
http://www.unicode.org/reports/tr10/#Handline_Illformed Grrr. http://www.unicode.org/reports/tr10/#Handling_Illformed I seem unable to handle ill-formed spelling today. :( --Ken
RE: Q is a Roman numeral?
I'm gonna take a wild stab here and assume that this is Q as the medieval Latin abbreviation for quingenti, which usually means 500, but also gets glossed just as a big number, as in milia quingenta thousands upon thousands. Maybe some medieval scribe substituted a Q for |V| (with an overscore on the V), which would be the more normal way to write 5,000 and then 500,000. --Ken Now that Q = 500,000 bit seems a little odd to me. I've never seen that anywhere else. Does anyone know where it came from? Is there real usage of Q for 500,000? —Ben Scarborough
Re: Are there Unicode processors?
Costello, Roger L. wrote: Are there Unicode processors? Bottom line, you need to be more specific about what level of processing you are talking about. As many have said, parsing a byte stream into UTF-{8, 16, 32} characters is everywhere. Converting between normalization forms Is a bit less common. Intricate text analysis is generally the domain of specialized tools. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell
Re: What does it mean to not be a valid string in Unicode?
Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. So in this kind of a case, what we are actually dealing with is: garbage in, principled, correct results out. ;-) Wouldn't the clean way be to ensure valid strings (only) when they're built and then make sure that string algorithms (only) preserve well-formedness of input? Perhaps this is how the system grew, but it seems to be that it's yet another legacy of C pointer arithmetic and about convenience of implementation rather than a safety or performance issue. Stephan
Re: What does it mean to not be a valid string in Unicode?
In practice and by design, treating isolated surrogates the same as reserved code points in processing, and then cleaning up on conversion to UTFs works just fine. It is a tradeoff that is up to the implementation. It has nothing to do with a legacy of C pointer arithmetic. It does represent a pragmatic choice some time ago, but there is no need getting worked up about it. Human scripts and their representation on computers is quite complex enough; in the grand scheme of things the handling of surrogates in implementations pales in significance. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Jan 7, 2013 at 9:43 PM, Stephan Stiller stephan.stil...@gmail.comwrote: Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. So in this kind of a case, what we are actually dealing with is: garbage in, principled, correct results out. ;-) Wouldn't the clean way be to ensure valid strings (only) when they're built and then make sure that string algorithms (only) preserve well-formedness of input? Perhaps this is how the system grew, but it seems to be that it's yet another legacy of C pointer arithmetic and about convenience of implementation rather than a safety or performance issue. Stephan