On 07/01/2013, Costello, Roger L. coste...@mitre.org wrote:
Hi Folks,
In the book, Unicode Demystified (p. xxii) it says:
An English-speaking programmer might assume,
for example, that given the three characters X, Y,
and Z, that if X sorts before Y, then XZ sorts before
Doug Ewell, Sun, 6 Jan 2013 20:57:58 -0700:
We are pretty much going round and round on this. The bottom line for
me is, it would be nice if there were a shorthand way of saying
big-endian UTF-16, and many people (including you?) feel that
UTF-16BE is that way, but it is not. That term has
Doug Ewell, Sun, 6 Jan 2013 20:57:58 -0700:
Doug Ewell, Sun, 6 Jan 2013 20:57:58 -0700:
The bottom line for me is, it would be nice if there were a
shorthand way of saying big-endian UTF-16, and many people
(including you?) feel that UTF-16BE is that way, but it is not.
One could say
Unicode libraries commonly provide functions that take a code point and
return a value, for example a property value. Such a function normally
accepts the whole range 0..10 (and may even return a default value for
out-of-range inputs).
Also, we commonly read code points from 16-bit Unicode
Markus Scherer markus dot icu at gmail dot com wrote:
Also, we commonly read code points from 16-bit Unicode strings, and
unpaired surrogates are returned as themselves and treated as such
(e.g., in collation). That would not be well-formed UTF-16, but it's
generally harmless in text
On Mon, Jan 7, 2013 at 10:48 AM, Doug Ewell d...@ewellic.org wrote:
Markus Scherer markus dot icu at gmail dot com wrote:
Also, we commonly read code points from 16-bit Unicode strings, and
unpaired surrogates are returned as themselves and treated as such
(e.g., in collation). That would
But still non-conformant.
That's incorrect.
The point I was making above is that in order to say that something is
non-conformant, you have to be very clear what it is non-conformant *TO*
.
Also, we commonly read code points from 16-bit Unicode strings, and
unpaired surrogates are returned
You're right, and I stand corrected. I read Markus's post too quickly.
Mark Davis ☕ mark at macchiato dot com wrote:
But still non-conformant.
That's incorrect.
The point I was making above is that in order to say that something is
non-conformant, you have to be very clear what it is
Well then I don't know why you need a definition of an Unicode 16-bit
string. For me it just means exactly the same as 16-bit string, and
the encoding in it is not relevant given you can put anything in it
without even needing to be conformant to Unicode. So a Java string is
exactly the same, a
Philippe Verdy said:
Well then I don't know why you need a definition of an Unicode 16-bit
string. For me it just means exactly the same as 16-bit string, and
the encoding in it is not relevant given you can put anything in it
without even needing to be conformant to Unicode. So a Java string
Philippe also said:
... Reserving UTF-16 for what the stadnard discusses as a
16-bit string, except that it should still require UTF-16
conformance (no unpaired surrogates and no non-characters) ...
For those following along, conformance to UTF-16 does *NOT* require no
non-characters.
Hi Folks,
An XML processor breaks up an XML document into its parts -- here's a start
tag, here's element content, here's an end tag, etc. -- and then makes those
parts (along with information about each part such as this part is a start
tag and this part is element content) available to XML
On Mon, Jan 7, 2013 at 2:34 PM, Costello, Roger L. coste...@mitre.org wrote:
Are there Unicode processors?
That is, are there processors that break up Unicode text into its parts --
here's a character, here's another character, here's still another character,
etc. -- and then makes those
That is not the typical way that Unicode text is processed.
Typically whatever OS you are using will supply mechanisms for iterating
through any Unicode string, returning each of the code points. It may also
offer APIs for returning information about each character (called 'property
values', or
Unicode processor??
If what you're looking for is code that breaks text into grapheme
clusters/words/lines/etc., that's called text segmentation and is described
in:
http://www.unicode.org/reports/tr29/
But you go on to talk about characters and their properties.. if you're
looking
On 2013/01/08 3:27, Markus Scherer wrote:
Also, we commonly read code points from 16-bit Unicode strings, and
unpaired surrogates are returned as themselves and treated as such (e.g.,
in collation). That would not be well-formed UTF-16, but it's generally
harmless in text processing.
Things
That's not the point (see successive messages).
Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**
On Mon, Jan 7, 2013 at 4:59 PM, Martin J. Dürst due...@it.aoyama.ac.jpwrote:
On 2013/01/08 3:27, Markus Scherer wrote:
Also, we commonly read code
This isn't directly related to Unicode, but I thought this would be a
good place to ask.
Specifically, I'm curious about figure 14 (Gordon 1982) from WG2 N3218
[http://std.dkuug.dk/jtc1/sc2/wg2/docs/N3218.pdf], which says:
Whereas our so-called Arabic numerals
are ten in number (0–9), the Roman
Martin,
The kind of situation Markus is talking about is illustrated particularly well
in collation. And there is a section 7.1.1 in UTS #10 specifically devoted to
this issue,:
http://www.unicode.org/reports/tr10/#Handline_Illformed
When weighting Unicode 16-bit strings for collation, you
http://www.unicode.org/reports/tr10/#Handline_Illformed
Grrr.
http://www.unicode.org/reports/tr10/#Handling_Illformed
I seem unable to handle ill-formed spelling today. :(
--Ken
I'm gonna take a wild stab here and assume that this is Q as the medieval
Latin abbreviation for quingenti, which usually means 500, but also gets
glossed just as a big number, as in milia quingenta thousands upon
thousands. Maybe some medieval scribe substituted a Q for |V| (with an
overscore
Costello, Roger L. wrote:
Are there Unicode processors?
Bottom line, you need to be more specific about what level of
processing you are talking about. As many have said, parsing a byte
stream into UTF-{8, 16, 32} characters is everywhere. Converting between
normalization forms Is a bit
Things like this are called garbage in, garbage-out (GIGO). It may be
harmless, or it may hurt you later.
So in this kind of a case, what we are actually dealing with is: garbage in,
principled, correct results out. ;-)
Wouldn't the clean way be to ensure valid strings (only) when they're
In practice and by design, treating isolated surrogates the same as
reserved code points in processing, and then cleaning up on conversion to
UTFs works just fine. It is a tradeoff that is up to the implementation.
It has nothing to do with a legacy of C pointer arithmetic. It does
represent a
24 matches
Mail list logo