RE: Terms for rotations

2014-11-10 Thread Whistler, Ken
http://www.unicode.org/L2/L2012/12321-n4342-signwriting.pdf That should give you some ideas about possible alternative approaches for the material you are dealing with. --Ken Could the characters SWR2 to SWR8 be applied to chess symbols or should new rotation modifiers be created

RE: Terms for rotations

2014-11-10 Thread Whistler, Ken
Look at this picture: http://www.permisecole.com/code-route/priorites/faux-carrefour-a-sens-giratoire.jpg Imagine you sit in this car and you want to turn RIGHT. What will you do? Will you turn the driving wheel clockwise or counterclockwise? And now imagine that you are motoring in a 1904

RE: Terms for rotations

2014-11-10 Thread Whistler, Ken
WIDDERSHINS is shorter then COUNTERCLOCKWISE, but is not exactly a common term, especially in technical English. Aye, but laddie, then we'd have to use DEASIL for CLOCKWISE! And we'd have wiccans after us to spell it DEOSIL instead. ;-) --Ken

RE: Terms for rotations

2014-11-07 Thread Whistler, Ken
Garth Wallace asked: I'm currently working towards a proposal to encode a set of symbols used in fairy chess and chess variants, and I have a question about naming conventions. Several of the symbols are rotations of already encoded symbols. ... It's even more unclear when it comes to

RE: Code charts and code points

2014-10-24 Thread Whistler, Ken
I think it is imaginable that someone wants to copy a block of characters from the code charts, as a handy way of getting them for inspection, e.g. for testing how some particular software renders them using some particular font(s). I would expect some confusion then if you had partly got all

RE: fonts for U7.0 scripts

2014-10-24 Thread Whistler, Ken
Tom Gewecke wondered: it seems that you would need permission to copy the glyph. I wonder if that is necessary. To follow on from Peter Constable's response, it comes down to the actual scenario at hand and precisely what one means by copy the glyph. Scenario 1 I want to use an example

RE: Question about a Normalization test

2014-10-23 Thread Whistler, Ken
Aaron Cannon asked: Hi all, from the latest version of the standard, on line 16977 of the normalization tests, I am a bit confused by the NFC form. It appears incorrect to me. Here's the line, sans comment: 0061 0305 0315 0300 05AE 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305

RE: Limits in UBA

2014-10-22 Thread Whistler, Ken
Eli, Embeddings are common in generated text. The guiding principle, is seemingly, when in doubt wrap the string in an embedding. At the UTC, we heard, that this can lead to very deep stacks - but I've never actually seen one with more than 63 levels - but that is not my topic here. I'd

RE: Limits in UBA

2014-10-22 Thread Whistler, Ken
Eli, I think you are correct that the BidiCharacterTest.txt data currently does not go beyond 3 nesting levels for testing the BPA part of UBA. I agree with Andrew that that is reasonable guide to the normal limit of meaningful bracket embeddings one might find in text. However, I don't think it

RE: Bidi Parenthesis Algorithm and BidiCharacterTest.txt

2014-10-15 Thread Whistler, Ken
I disagree that this makes N0 a recursive rule. It is a rule with repeatedly applicable subparts. And like nearly all the rules in the UBA (except ones which explicitly state that they apply to *original* Bidi_Class values, which thus have to be stored across the life of the processing

RE: Bidi Parenthesis Algorithm and BidiCharacterTest.txt

2014-10-14 Thread Whistler, Ken
Eli asked in response to Andrew: · Since 2-17 is now R and not neutral, the resolution of 3-9 is R because the check for context finds the opening parenthesis at 2 (now R) before the a at 1. Therefore 2-17 is R under N0c2. But there's nothing about this in the UAX#9 language!

RE: What happened to...?

2014-09-19 Thread Whistler, Ken
Michael, Declines to take action” is pretty thin. A proposal which is declined by the UTC doesn't automatically create an obligation to write an extended dissertation explaining the rationale and putting that rationale on record. It might be one thing if there were a lot of controversy

RE: Request for Information

2014-07-24 Thread Whistler, Ken
Fantasai asked: I would like to request that Unicode include, for each writing system it encodes, some information on how it might justify. Following up on the comment and examples provided by Richard Wordingham, I'd like to emphasize a relevant point: Scripts may be used for *multiple*

RE: Noto adds CJK, plus new user-facing website

2014-07-16 Thread Whistler, Ken
Andrew, Everybody recognizes the potential risks of getting out too far over one's skis in implementations, but this particular one seems a relatively small risk. Seldom (if ever?) has a NB objected in ballot to these small repertoire additions that have periodically been tacked on at the end of

RE: Apparent discrepanccy between FAQ and Age.txt

2014-06-10 Thread Whistler, Ken
Karl Williamson noted: The FAQ http://www.unicode.org/faq/private_use.html#sentinels says that the last 2 code points on the planes except BMP were made noncharacters in TUS 3.1. DerivedAge.txt gives 2.0 for these. The *concept* of noncharacter was not invented until Unicode 3.1, so it

RE: Swift

2014-06-05 Thread Whistler, Ken
Hmmm. Any programming language project that derives from someone who describes himself as a “polyhistor”, which claims to be polymorphic and pasigraphic and multi-lingual and orthogonal and polysynthetic, which draws its inspiration from the theory of “Natural Language Metasemantics”, and which

RE: UTF-16 Encoding Scheme and U+FFFE

2014-06-03 Thread Whistler, Ken
You cannot even be very confident of not finding actual ill-formed UTF-16, like unpaired surrogates, in an external file, let alone noncharacters. As for the noncharacters, take a look at the collation test files that we distribute with each version of UCA. The test data includes test strings

RE: Indic Syllabic Categories

2014-05-12 Thread Whistler, Ken
Richard Wordingham asked: Is the provisional property 'Indic_Syllabic_Category' defined by anything deeper than the UCD file IndicSyllabicCategory itself? Basically, no. It simply gathers together information scattered about in the core spec and elsewhere about claims regarding what all the

RE: Do `Grapheme_Extend` characters only apply to `Grapheme_Base`?

2014-04-24 Thread Whistler, Ken
On 23 Apr 2014, at 22:16, Mathias Bynens math...@qiwi.be wrote: Let’s say I’m writing a program that strips combining characters and grapheme extenders from an input string. For combining marks, I’m looking for any non-combining marks (e.g. `a`) followed by one or more combining marks

Bidi Brackets for Dummies

2014-04-24 Thread Whistler, Ken
Given the incredible level of interest shown on this list during the last week, I am glad that I can finally announce the publication of Bidi Brackets for Dummies: http://www.unicode.org/notes/tr39/ I had wanted to publish that several weeks ago, but unfortunately, publication was held up for

RE: ID_Start, ID_Continue, and stability extensions

2014-04-23 Thread Whistler, Ken
Mathias, What are the “stability extensions” this document refers to? Here are the code points that match the respective property according to `DerivedCoreProperties.txt`, yet don’t match these properties if you’re adding/removing the categories manually based on the property definition in

RE: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-21 Thread Whistler, Ken
Ilya noted: [Below, I completely ignore BIDI part of the specification, and concentrate ONLY on the parens match. I do not understand why this question is interlaced with BIDI determination; I trust that it is.] Actually, it is, because the bracket-matching is really only

RE: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector?

2014-04-02 Thread Whistler, Ken
Ilya, U+23AF is *definitely* not a variation selector at all. It is part of a set of bracket pieces (and other graphic pieces) in the range U+239B..U+23B1. See discussion of the topic at: http://www.unicode.org/forum/viewtopic.php?f=35t=206 See also Section 2.13 of UTR #25:

RE: 23AF HORIZONTAL LINE EXTENSION: glyph or variation selector?

2014-04-02 Thread Whistler, Ken
Yucca noted: These glyphic pieces of symbols are only relevant and useful in the context of mathematical typesetting programs like TeX. I’m not sure whether TeX uses such characters at all. TeX is oriented towards typesetting glyphs, often not caring that much about abstract characters.

RE: Bidi reordering of soft hyphen

2014-04-01 Thread Whistler, Ken
I don’t think the answer is directly deduced from UAX #9, because it involves deciding where to insert a visible hyphen for display. However, I think the correct answer here is your number two guess, i.e. (in a RTL paragraph context): -car SI TORRAC A way to think about this, rather than

RE: Bidi reordering of soft hyphen

2014-04-01 Thread Whistler, Ken
Richard Wordingham noted: As U+2010 HYPHEN would result in text like 'car-', in an English influenced context I would also go with 'car-'. That's always a possibility, I suppose, but I'm not sure what English influenced context means here. The examples I just gave were for a RTL

RE: Bidi reordering of soft hyphen

2014-04-01 Thread Whistler, Ken
Is it legitimate to truncate the context to a single line? The BiDi algorithm is attempting to interpret unlabelled text as embedded text (it's not an arbitrary dance), and in just one line there is no indicator of whether the hyphen is part of the LTR text embedded in RTL text. For

RE: Editing Sinhala and Similar Scripts

2014-03-19 Thread Whistler, Ken
And I think you need to distinguish between *proximate* behavior in an editor and editing behavior in general. Once a user enters editing mode, the expectation that we (the software community writing text editors) have built, in interaction with users, is that within reason, something that you

RE: Romanized Singhala got great reception in Sri Lanka

2014-03-17 Thread Whistler, Ken
Well, I actually don’t see. I took a look at the Sinhala you inserted in this email. I cannot tell what you did at your input end (about “inserted all joiners”), but there are no actual joiners in the text itself. It displayed just fine in my email (including the correct conditional formatting of

RE: Names for control characters (Was: (in 6429) in allkeys.txt)

2014-03-12 Thread Whistler, Ken
Please be very careful here. Having a non-empty value in field 1 of UnicodeData.txt is *not* the same has having a Unicode name. See: http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf#G135207 for the gory details. The Unicode name is formally defined in terms of the Name property, which

RE: Names for control characters

2014-03-12 Thread Whistler, Ken
Per continued: I know it's not a name. My question was *why* control characters don't *have* names like CONTROL CHARACTER NULL CONTROL CHARACTER START OF HEADING CONTROL CHARACTER START OF TEXT etc. It would be so obvious to have it like that, so I assume there is some

RE: (in 6429) in allkeys.txt

2014-03-11 Thread Whistler, Ken
Per asked: In the DUCET file allkeys.txt, http://www.unicode.org/Public/UCA/latest/allkeys.txt , there is (in 6429) as a comment for some characters. I first didn't understand why, but then I realized those are control characters that are part of ISO/EIC 6429. Why is that pointed out

RE: (in 6429) in allkeys.txt

2014-03-11 Thread Whistler, Ken
I agree that a clarification in the text would be better than a comment in allkeys.txt. But I also think just changing (in 6429) to (in ISO 6429) would be enough. (Strange as it might seem for list regulars not everyone immediately makes the right association from this four-digit number.

RE: Transforming BidiTest.txt to the format of BidiCharacterTest.txt

2014-02-12 Thread Whistler, Ken
Eric, The C version of the bidiref code does that, in part. See the function br_ParseFileFormatB in brinput.c. http://www.unicode.org/Public/PROGRAMS/BidiReferenceC/6.3.0/ It doesn't actually *transform* the BidiTest.txt file to output the other format, but it parses the input and then

RE: Engmagate?

2013-12-13 Thread Whistler, Ken
Well, inconceivable? No. Inadvisable? yes. First of all, such “comments” are not actually “comments”—they are the result of a fairly cumbersome and drawn-out process of adding *normative* standardized variation sequences to the standard. Second – although this is a nit – FE0E and FE0F would

RE: Code point vs. scalar value

2013-09-19 Thread Whistler, Ken
Stephan Stiller seems unconvinced by the various attempts to explain the situation. Perhaps an authoritative explanation of the textual history might assist. Stephan demands an answer: I want to know why the Glossary claims that surrogate code points are [r]eserved for use by UTF-16. Reason

Origin of Ellipsis (was: RE: Empty set)

2013-09-13 Thread Whistler, Ken
Stephan Stiller noted: Maybe ... and the origin of the single-glyph ellipsis remains a mystery to me. As Philippe surmised, it is a compatibility character, originally included in the Unicode 1.0 repertoire for cross-mapping to existing legacy encodings: Code Page 932: 0x81 0x64 Code Page

RE: Origin of Ellipsis (was: RE: Empty set)

2013-09-13 Thread Whistler, Ken
I wrote: As Philippe surmised, it is a compatibility character, originally included in the Unicode 1.0 repertoire for cross-mapping to existing legacy encodings: Code Page 932: 0x81 0x64 Code Page 949: 0xA1 0xA6 Asmus responded: which just pushes that question forward in time...

Case Table Compresison Assumptions (was: RE: Posting Links to Ballots (was: RE: Why blackletter letters?))

2013-09-13 Thread Whistler, Ken
Steffen, FYI, Unicode 7.0, when it comes out, will have another entire bicameral (casing) script added to it: Warang Citi. And when Old Hungarian is finally published, at some point after Unicode 7.0, that will be *another* bicameral script added. It is unlikely that those two will be the last.

RE: Why blackletter letters?

2013-09-10 Thread Whistler, Ken
Yucca asked: As far as I can see, the document summarizes an agreement in an ad hoc meeting. So it’s not late at all to raise objections, is it? It is way, way, waaay too late to raise objections for these two. Those characters are *published* in ISO/IEC 10646:2011 Amendment 1. They were

RE: ASCII control codes in sequences of multibyte character sets

2013-08-30 Thread Whistler, Ken
Steffen, Sure. You encounter this problem for any multi-byte EBCDIC-based character encoding. In fact for any single-byte EBCDIC-based character encoding, as well. The EBCDIC control that corresponds to a line feed is either 0x15 or 0x25, depending on revisions. But you wouldn't ordinarily run

RE: What to backup after corruption of code units?

2013-08-29 Thread Whistler, Ken
The text in question is not exactly new to Unicode 6.2, probably goes back to around the time UTF-8 and UTF-16 were added over a decade ago. Getting a single question on this passage after all these years would seem to indicate that confusion isn't exactly rampant. Just to address the

RE: Just an observation

2013-08-06 Thread Whistler, Ken
Steffen Daode Nurpmeso continued: Hmm. To me, this raises the question why these constraints were introduced at all. Imho either one adds constraints due to solid considerations, and enforces them after some period of backward compatibility, or there simply should be no constraints. What

RE: polytonic Greek: diacritics above long vowels ᾱ, ῑ, ῡ

2013-08-05 Thread Whistler, Ken
Poring back over this voluminous thread to Stephan Stiller's original question: If one wants to indicate vowel length for the length-ambiguous vowels α, ι, υ in Ancient Greek, one writes ᾱ, ῑ, ῡ. Is there a reason for why there are no diacritic-precomposed characters? I guess it's because

RE: _Unicode_code_page_and_?.net

2013-08-05 Thread Whistler, Ken
On 7/30/2013 3:27 PM, Asmus Freytag wrote: architectures that depended on swapping character sets (code pages) in mid stream I thought systems were usually married to a particular code page. I'm wondering where (historically) you'd actually change to a different code page

RE: Just an observation

2013-08-05 Thread Whistler, Ken
Steffen Daode Nurpmeso observed: Hello, in UAX #44 i read Simple_Titlecase_Mapping ... Note: If this field is null, then the Simple_Titlecase_Mapping is the same as the Simple_Uppercase_Mapping for this character. So a parser has to be aware of this, automatically falling back

RE: What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Whistler, Ken
Suppose that these hex bytes: C3 83 C2 B1 show up in a message and the message contains no hint what its encoding is. Perhaps it is 8859-1, in which case the message consists of four 1-byte characters: C3 = Ã 83 = the “no break here” character C2 = Â B1 = ± Perhaps it

Scalability of ScriptExtensions (was: RE: Borrowed Thai Punctuation in Tai Tham Text)

2013-07-08 Thread Whistler, Ken
Richard Wordingham asked: How many examples do I need to collect to add Tai Tham to the script extensions property for ... ? IMO, a couple well-documented examples ought to suffice. But, this query raises a couple further questions for me regarding the scalability and maintenance of

RE: Suggestion for new dingbats/symbols

2013-05-30 Thread Whistler, Ken
How to write a mail like this: When you arrive at Madrid airport, follow the sign that looks like this: [?] Even if the font library supports all needed symbols, it will be easier to send a photo than to choose the sign from a huge Unicode symbols list. Yep. This discussion about signs is

RE: UTC Document Register Now Public

2013-04-19 Thread Whistler, Ken
William J.G. Overington asked: Suppose that a member of the public sends a document that seeks discussion by the Unicode Technical Committee about whether the scope of what Unicode encodes should be extended in some particular regard, with the member of the public writing about why he or she

Encoding localizable sentences (was: RE: UTC Document Register Now Public)

2013-04-19 Thread Whistler, Ken
However, now that I've got your hopes up on procedural grounds... Getting on to the particulars: I do have two particular reasons for asking. 2. My research. There is a document entitled locse027_four_simulations.pdf available from the following forum post.

RE: Processing Digit Variants

2013-03-18 Thread Whistler, Ken
Richard Wordingham wrote: European digits (U+0030 to U+0039) may, since Unicode 6.1.0, be used with variation selectors. As their primary purpose is for use with u+20E3 COMBINING ENCLOSING KEYCAP, is it legitimate to fail to recognise strings of digits with variation selectors as

RE: Size of Weights in Unicode Collation Algorithm

2013-03-14 Thread Whistler, Ken
Richard Wordingham wrote: Actually, there is a subtle and nasty difference, but probably one that will very rarely strike practical use. It's most obvious manifestation is in the application of the UCA parametric tailoring topVariable=u2FD5. U+2FD5 KANGXI RADICAL FLUTE is the last symbol in

RE: Size of Weights in Unicode Collation Algorithm

2013-03-13 Thread Whistler, Ken
Richard Wordingham wrote: One of the changes from Version 6.1.0 to 6.2.0 of the the UCA (UTS#10) was to changed weights from being 16 bits to just being general non-negative integers. Was this just to accommodate the 4th weight in DUCET (scheduled for deletion in Version 6.3.0), or is it

RE: Size of Weights in Unicode Collation Algorithm

2013-03-13 Thread Whistler, Ken
Richard Wordingham wrote: It loosened up the spec, so that the spec itself didn't seem to be requiring that each of the first 3 levels had to be expressed with a full 16 bits in any collation element table. I don't read it that way. But it did allow the 4th weight to go up to 10!

RE: Does HYPHEN BULLET have synonyms?

2013-02-22 Thread Whistler, Ken
Jukka said: The comments at the start of NamesList.txt say that it is “semi-automatically derived from UnicodeData.txt”, but the information you are referring to has actually been picked up from the code charts. They contain both informative alias names and cross references. The

RE: New Canonical Decompositions to Non-Starters

2013-02-18 Thread Whistler, Ken
Well, it isn't prohibited, so I guess you will need to be forever vigilant in view of the possibility that somebody might get it in their head to encode some combining mark that isn't already accounted for in Tibetan *and* that they would simultaneously insist that a precomposed form of that

RE: FCD and Collation

2013-02-11 Thread Whistler, Ken
Does anyone feel up to rigorously justifying revisions to the concepts and algorithms of FCD and canonical closure? Occasionally one will encounter cases where the canonical closure is infinite - in these cases, normalisation will be necessary regardless of the outcome of the FCD check.

FW: Why are the low surrogates numerically larger than the high surrogates?

2013-01-23 Thread Whistler, Ken
-Original Message- From: ken.whist...@sap.com Sent: Wednesday, January 23, 2013 10:48 AM To: 'Costello, Roger L.' Subject: RE: Why are the low surrogates numerically larger than the high surrogates? Why are the low surrogates numerically larger than the high surrogates? That is,

RE: RLI and bdi, and how to get an update of changes

2013-01-15 Thread Whistler, Ken
what does that different from the RLI U+2067/PDF U+2068? if it is the same, can we use U+2066 in HTML replacing bdi? Code points 2066, 2067, and 2068 are unassigned. I presume you mean U+202B RIGHT-TO-LEFT EMBEDDING (RLE) and U+202C POP DIRECTIONAL FORMATTING. No, actually, I think

RE: What does it mean to not be a valid string in Unicode?

2013-01-08 Thread Whistler, Ken
Sorry, but I have to disagree here. If a list of strings contains items with lone surrogates (garbage), then sorting them doesn't make the garbage go away, even if the items may be sorted in correct order according to some criterion. Well, yeah, I wasn't claiming that the principled, correct

RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken
Philippe Verdy said: Well then I don't know why you need a definition of an Unicode 16-bit string. For me it just means exactly the same as 16-bit string, and the encoding in it is not relevant given you can put anything in it without even needing to be conformant to Unicode. So a Java string

RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken
Philippe also said: ... Reserving UTF-16 for what the stadnard discusses as a 16-bit string, except that it should still require UTF-16 conformance (no unpaired surrogates and no non-characters) ... For those following along, conformance to UTF-16 does *NOT* require no non-characters.

RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken
Martin, The kind of situation Markus is talking about is illustrated particularly well in collation. And there is a section 7.1.1 in UTS #10 specifically devoted to this issue,: http://www.unicode.org/reports/tr10/#Handline_Illformed When weighting Unicode 16-bit strings for collation, you

RE: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Whistler, Ken
http://www.unicode.org/reports/tr10/#Handline_Illformed Grrr. http://www.unicode.org/reports/tr10/#Handling_Illformed I seem unable to handle ill-formed spelling today. :( --Ken

RE: Q is a Roman numeral?

2013-01-07 Thread Whistler, Ken
I'm gonna take a wild stab here and assume that this is Q as the medieval Latin abbreviation for quingenti, which usually means 500, but also gets glossed just as a big number, as in milia quingenta thousands upon thousands. Maybe some medieval scribe substituted a Q for |V| (with an overscore

RE: holes (unassigned code points) in the code charts

2013-01-04 Thread Whistler, Ken
Stephan Stiller continued: Occasionally the question is asked how many characters Unicode has. This question has an answer in section D.1 of the Unicode Standard. I suspect, however, that once in a while the motivation for asking this question is to find out how much of Unicode has been used

RE: holes (unassigned code points) in the code charts

2013-01-04 Thread Whistler, Ken
Whoops! http://www.unicode.org/alloc/CurrentAllocation.html --Ken The editors maintain some statistical information relevant to this fun question at: http://www.unicode.org/alloc/CurrentAllocaiton.html

RE: What does it mean to not be a valid string in Unicode?

2013-01-04 Thread Whistler, Ken
Yannis' use of the terminology not ... a valid string in Unicode is a little confusing there. A Unicode string with the sequence, say, U+0300, U+0061 (a combining grave mark, followed by a), is valid Unicode in the sense that it just consists of two Unicode characters in a sequence. It is

RE: What does it mean to not be a valid string in Unicode?

2013-01-04 Thread Whistler, Ken
One of the reasons why the Unicode Standard avoids the term “valid string”, is that it immediate begs the question, valid *for what*? The Unicode string U+0061, U+, U+0062 is just a sequence of 3 Unicode characters. It is valid *for* use in internal processing, because for my own

RE: Jamo_Short_Name

2013-01-02 Thread Whistler, Ken
André Schappo asked: Been looking at http://www.unicode.org/Public/UNIDATA/Jamo.txt There appears to be 2 different romanizations at play in the file? One for the short name and another for the full name eg 1100; G # HANGUL CHOSEONG KIYEOK I have searched unicode.org but cannot find

RE: locale-aware string comparisons

2012-12-31 Thread Whistler, Ken
Well, in answering the question which was actually posed here: 1. ISO/IEC 10646 has absolutely nothing to say about this issue, because 10646 does not define case mapping at all. 2. The Unicode Standard *does* define case mapping, of course, as well as case folding. The relevant details are in

RE: UCA and Russian letter Ё

2012-12-26 Thread Whistler, Ken
The UCA algorithm itself has no opinion on this issue. It is simply a specification of *how* to compare strings at multiple levels, given a multi-level collation weight table. The UCA *does* have a default behavior, of course, based on the DUCET table. And the DUCET table puts all Unicode

RE: UCA and Russian letter Ё

2012-12-26 Thread Whistler, Ken
Leo asked: My question was narrower: assuming that the strings being compared are words, could it be supported without any markup? ... where it refers to conditional weighting based on the (identified) word boundary. And the answer to that is no, unless the word boundary was explicitly

RE: UCA and Russian letter Ё

2012-12-21 Thread Whistler, Ken
Leo Broukhis said: Granted, not yet, but by itself the argument is invalid. Unicode collation rules are descriptive; I'm not sure what you mean by that. UTS #10 is a *specification* of an algorithm, with various options for tailoring and parameterization which make it possible to

RE: Question about normalization tests

2012-12-10 Thread Whistler, Ken
Your misunderstanding is at the highlighted statement below. Actually 0300 *is* blocked from 0061 in this sequence, because it is preceded by a character with the same canonical combining class (i.e. U+0305, ccc=230). A blocking context is the preceding combining character either having ccc=0

What is happening with hieroglyphs (was: RE: Why 17 planes?)

2012-11-28 Thread Whistler, Ken
Philippe is (apparently) referring to higher-level protocols for markup of hieroglyphic text. See, e.g., Table 14-10 and Figure 14-2, p. 489 in Section 14.18, Egyptian Hieroglyphs in TUS 6.2: http://www.unicode.org/versions/Unicode6.2.0/ch14.pdf Similar kinds of higher-level protocols are

RE: Why 17 planes? (was: Re: Why 11 planes?)

2012-11-27 Thread Whistler, Ken
There isn't an actual problem here which needs a solution, satisfactory, or otherwise. The persistence of the 17 planes may not be enough meme on this list is an interesting phenomenon in itself, but has no practical impact on any of the actual ongoing work on maintenance of the encoding

RE: StandardizedVariants.txt error?

2012-11-26 Thread Whistler, Ken
Actually, I think the omission here is the word canonical. In other words, Section 16.4 should probably read: The base character in a variation sequence is never a combining character or a *canonical* decomposable character. Note that with this addition, StandardizedVariants.txt poses no

RE: latin1 decoder implementation

2012-11-16 Thread Whistler, Ken
The first 256 characters of the Unicode Standard *are* compatible with ISO/IEC 8859-1 (Latin-1), but you need to distinguish what happens for the graphic characters from what happens for the control codes. ISO 8859-1 defines *graphic* characters in the ranges 0x20..0x7E, 0xA0..0xFF. Those are

RE: latin1 decoder implementation

2012-11-16 Thread Whistler, Ken
Actually, what Buck really needs is Section 16.1 Control Codes: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf That explains the situation for the *non* graphic characters in the range U+..U+00FF, which is the source of the concern for Buck's skeptical workmates, I'm sure. --Ken

RE: latin1 decoder implementation

2012-11-16 Thread Whistler, Ken
A IANA-registered character *map* is a very different animal from a character encoding standard per se. The actual character encoding standard, ISO/IEC 8859-1:1998 does not define the C0 and C1 control codes (and never will). That was what I was quoting from. A mapping table, on the other

RE: latin1 decoder implementation

2012-11-16 Thread Whistler, Ken
No Unicode doesn't. But yes, is *does* follow that decoding C0/C1 control codes produces a Unicode code point of equal value. RTFM. TUS 6.2, p. 544: There are 65 code points set aside in the Unicode Standard for compatibility with the C0 and C1 control codes defined in the ISO/IEC 2022

RE: latin1 decoder implementation

2012-11-16 Thread Whistler, Ken
Yep. --Ken Latin1 explicitly gives no semantics to several byte values (for example 0x81), but acknowleges that other standards will define their semantics. Unicode provides code-points with equally-undefined semantics so that these bytes can pass through without change. This allows a

RE: VS: Mayan numerals

2012-09-26 Thread Whistler, Ken
Marion Gunn wrote: -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Marion Gunn Sent: Wednesday, September 26, 2012 10:53 AM To: 'Unicode List' Subject: Re: VS: Mayan numerals ... This simple request to encode Mayan numerals has