Re: If X sorts before Y, then XZ sorts before YZ ... example of where that's not true?

2013-01-07 Thread Christopher Fynn
On 07/01/2013, Costello, Roger L. coste...@mitre.org wrote: Hi Folks, In the book, Unicode Demystified (p. xxii) it says: An English-speaking programmer might assume, for example, that given the three characters X, Y, and Z, that if X sorts before Y, then XZ sorts before

Re: Why is endianness relevant when storing data on disks but not when in memory?

2013-01-07 Thread Leif Halvard Silli
Doug Ewell, Sun, 6 Jan 2013 20:57:58 -0700: We are pretty much going round and round on this. The bottom line for me is, it would be nice if there were a shorthand way of saying big-endian UTF-16, and many people (including you?) feel that UTF-16BE is that way, but it is not. That term has

Re: Why is endianness relevant when storing data on disks but not when in memory?

2013-01-07 Thread Leif Halvard Silli
Doug Ewell, Sun, 6 Jan 2013 20:57:58 -0700: Doug Ewell, Sun, 6 Jan 2013 20:57:58 -0700: The bottom line for me is, it would be nice if there were a shorthand way of saying big-endian UTF-16, and many people (including you?) feel that UTF-16BE is that way, but it is not. One could say

Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Markus Scherer
Unicode libraries commonly provide functions that take a code point and return a value, for example a property value. Such a function normally accepts the whole range 0..10 (and may even return a default value for out-of-range inputs). Also, we commonly read code points from 16-bit Unicode

RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Doug Ewell
Markus Scherer markus dot icu at gmail dot com wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text

Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Markus Scherer
On Mon, Jan 7, 2013 at 10:48 AM, Doug Ewell d...@ewellic.org wrote: Markus Scherer markus dot icu at gmail dot com wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would

Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Mark Davis ☕
But still non-conformant. That's incorrect. The point I was making above is that in order to say that something is non-conformant, you have to be very clear what it is non-conformant *TO* . Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned

RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Doug Ewell
You're right, and I stand corrected. I read Markus's post too quickly. Mark Davis ☕ mark at macchiato dot com wrote: But still non-conformant. That's incorrect. The point I was making above is that in order to say that something is non-conformant, you have to be very clear what it is

Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Philippe Verdy
Well then I don't know why you need a definition of an Unicode 16-bit string. For me it just means exactly the same as 16-bit string, and the encoding in it is not relevant given you can put anything in it without even needing to be conformant to Unicode. So a Java string is exactly the same, a

RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken
Philippe Verdy said: Well then I don't know why you need a definition of an Unicode 16-bit string. For me it just means exactly the same as 16-bit string, and the encoding in it is not relevant given you can put anything in it without even needing to be conformant to Unicode. So a Java string

RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken
Philippe also said: ... Reserving UTF-16 for what the stadnard discusses as a 16-bit string, except that it should still require UTF-16 conformance (no unpaired surrogates and no non-characters) ... For those following along, conformance to UTF-16 does *NOT* require no non-characters.

Are there Unicode processors?

2013-01-07 Thread Costello, Roger L.
Hi Folks, An XML processor breaks up an XML document into its parts -- here's a start tag, here's element content, here's an end tag, etc. -- and then makes those parts (along with information about each part such as this part is a start tag and this part is element content) available to XML

Re: Are there Unicode processors?

2013-01-07 Thread David Starner
On Mon, Jan 7, 2013 at 2:34 PM, Costello, Roger L. coste...@mitre.org wrote: Are there Unicode processors? That is, are there processors that break up Unicode text into its parts -- here's a character, here's another character, here's still another character, etc. -- and then makes those

Re: Are there Unicode processors?

2013-01-07 Thread Mark Davis ☕
That is not the typical way that Unicode text is processed. Typically whatever OS you are using will supply mechanisms for iterating through any Unicode string, returning each of the code points. It may also offer APIs for returning information about each character (called 'property values', or

RE: Are there Unicode processors?

2013-01-07 Thread Phillips, Addison
Unicode processor?? If what you're looking for is code that breaks text into grapheme clusters/words/lines/etc., that's called text segmentation and is described in: http://www.unicode.org/reports/tr29/ But you go on to talk about characters and their properties.. if you're looking

Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Martin J. Dürst
On 2013/01/08 3:27, Markus Scherer wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. Things

Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Mark Davis ☕
That's not the point (see successive messages). Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Jan 7, 2013 at 4:59 PM, Martin J. Dürst due...@it.aoyama.ac.jpwrote: On 2013/01/08 3:27, Markus Scherer wrote: Also, we commonly read code

Q is a Roman numeral?

2013-01-07 Thread Ben Scarborough
This isn't directly related to Unicode, but I thought this would be a good place to ask. Specifically, I'm curious about figure 14 (Gordon 1982) from WG2 N3218 [http://std.dkuug.dk/jtc1/sc2/wg2/docs/N3218.pdf], which says: Whereas our so-called Arabic numerals are ten in number (0–9), the Roman

RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken
Martin, The kind of situation Markus is talking about is illustrated particularly well in collation. And there is a section 7.1.1 in UTS #10 specifically devoted to this issue,: http://www.unicode.org/reports/tr10/#Handline_Illformed When weighting Unicode 16-bit strings for collation, you

RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken
http://www.unicode.org/reports/tr10/#Handline_Illformed Grrr. http://www.unicode.org/reports/tr10/#Handling_Illformed I seem unable to handle ill-formed spelling today. :( --Ken

RE: Q is a Roman numeral?

2013-01-07 Thread Whistler, Ken
I'm gonna take a wild stab here and assume that this is Q as the medieval Latin abbreviation for quingenti, which usually means 500, but also gets glossed just as a big number, as in milia quingenta thousands upon thousands. Maybe some medieval scribe substituted a Q for |V| (with an overscore

Re: Are there Unicode processors?

2013-01-07 Thread Doug Ewell
Costello, Roger L. wrote: Are there Unicode processors? Bottom line, you need to be more specific about what level of processing you are talking about. As many have said, parsing a byte stream into UTF-{8, 16, 32} characters is everywhere. Converting between normalization forms Is a bit

Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Stephan Stiller
Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. So in this kind of a case, what we are actually dealing with is: garbage in, principled, correct results out. ;-) Wouldn't the clean way be to ensure valid strings (only) when they're

Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Mark Davis ☕
In practice and by design, treating isolated surrogates the same as reserved code points in processing, and then cleaning up on conversion to UTFs works just fine. It is a tradeoff that is up to the implementation. It has nothing to do with a legacy of C pointer arithmetic. It does represent a