http://www.unicode.org/L2/L2012/12321-n4342-signwriting.pdf
That should give you some ideas about possible alternative approaches
for the material you are dealing with.
--Ken
Could the characters SWR2 to SWR8 be applied to chess symbols or should
new rotation modifiers be created
Look at this picture:
http://www.permisecole.com/code-route/priorites/faux-carrefour-a-sens-giratoire.jpg
Imagine you sit in this car and you want to turn RIGHT. What will you
do? Will you turn the driving wheel clockwise or counterclockwise?
And now imagine that you are motoring in a 1904
WIDDERSHINS is shorter then
COUNTERCLOCKWISE, but is not exactly a common term, especially in
technical English.
Aye, but laddie, then we'd have to use DEASIL for CLOCKWISE!
And we'd have wiccans after us to spell it DEOSIL instead. ;-)
--Ken
Garth Wallace asked:
I'm currently working towards a proposal to encode a set of symbols
used in fairy chess and chess variants, and I have a question about
naming conventions. Several of the symbols are rotations of already
encoded symbols. ...
It's even more unclear when it comes to
I think it is imaginable that someone wants to copy a block of
characters from the code charts, as a handy way of getting them for
inspection, e.g. for testing how some particular software renders them
using some particular font(s). I would expect some confusion then if you
had partly got all
Tom Gewecke wondered:
it seems that you would
need permission to copy the glyph. I wonder if that is necessary.
To follow on from Peter Constable's response, it comes down to
the actual scenario at hand and precisely what one means by
copy the glyph.
Scenario 1
I want to use an example
Aaron Cannon asked:
Hi all, from the latest version of the standard, on line 16977 of the
normalization tests, I am a bit confused by the NFC form. It appears
incorrect to me. Here's the line, sans comment:
0061 0305 0315 0300 05AE 0062;0061 05AE 0305 0300 0315 0062;0061 05AE
0305
Eli,
Embeddings are common in generated text. The guiding principle, is
seemingly, when in doubt wrap the string in an embedding. At the UTC, we
heard, that this can lead to very deep stacks - but I've never actually seen
one with more than 63 levels - but that is not my topic here.
I'd
Eli,
I think you are correct that the BidiCharacterTest.txt data currently
does not go beyond 3 nesting levels for testing the BPA part of UBA.
I agree with Andrew that that is reasonable guide to the normal limit
of meaningful bracket embeddings one might find in text. However,
I don't think it
I disagree that this makes N0 a recursive rule. It is a rule with
repeatedly
applicable subparts. And like nearly all the rules in the UBA (except ones
which explicitly state that they apply to *original* Bidi_Class values,
which thus have to be stored across the life of the processing
Eli asked in response to Andrew:
· Since 2-17 is now R and not neutral, the resolution of 3-9 is R because
the
check for context finds the opening parenthesis at 2 (now R) before the a
at 1.
Therefore 2-17 is R under N0c2.
But there's nothing about this in the UAX#9 language!
Michael,
Declines to take action” is pretty thin.
A proposal which is declined by the UTC doesn't automatically
create an obligation to write an extended dissertation explaining
the rationale and putting that rationale on record. It might be
one thing if there were a lot of controversy
Fantasai asked:
I would like to request that Unicode include, for each writing system it
encodes, some information on how it might justify.
Following up on the comment and examples provided by Richard
Wordingham, I'd like to emphasize a relevant point:
Scripts may be used for *multiple*
Andrew,
Everybody recognizes the potential risks of getting out too
far over one's skis in implementations, but this particular one
seems a relatively small risk. Seldom (if ever?) has a NB
objected in ballot to these small repertoire additions that
have periodically been tacked on at the end of
Karl Williamson noted:
The FAQ http://www.unicode.org/faq/private_use.html#sentinels
says that the last 2 code points on the planes except BMP were made
noncharacters in TUS 3.1. DerivedAge.txt gives 2.0 for these.
The *concept* of noncharacter was not invented until Unicode 3.1,
so it
Hmmm.
Any programming language project that derives from someone who describes
himself as a “polyhistor”, which claims to be polymorphic and pasigraphic and
multi-lingual and orthogonal and polysynthetic, which draws its inspiration
from the
theory of “Natural Language Metasemantics”, and which
You cannot even be very confident of not finding actual ill-formed
UTF-16, like unpaired surrogates, in an external file, let alone
noncharacters.
As for the noncharacters, take a look at the collation test files
that we distribute with each version of UCA. The test data includes
test strings
Richard Wordingham asked:
Is the provisional property 'Indic_Syllabic_Category' defined by
anything deeper than the UCD file IndicSyllabicCategory itself?
Basically, no. It simply gathers together information scattered
about in the core spec and elsewhere about claims regarding
what all the
On 23 Apr 2014, at 22:16, Mathias Bynens math...@qiwi.be wrote:
Let’s say I’m writing a program that strips combining characters and
grapheme extenders from an input string.
For combining marks, I’m looking for any non-combining marks (e.g. `a`)
followed by one or more combining marks
Given the incredible level of interest shown on this list during
the last week, I am glad that I can finally announce the publication
of Bidi Brackets for Dummies:
http://www.unicode.org/notes/tr39/
I had wanted to publish that several weeks ago, but unfortunately,
publication was held up for
Mathias,
What are the “stability extensions” this document refers to?
Here are the code points that match the respective property according to
`DerivedCoreProperties.txt`, yet don’t match these properties if you’re
adding/removing the categories manually based on the property definition in
Ilya noted:
[Below, I completely ignore BIDI part of the specification, and
concentrate ONLY on the parens match. I do not understand why this
question is interlaced with BIDI determination; I trust that it is.]
Actually, it is, because the bracket-matching is really only
Ilya,
U+23AF is *definitely* not a variation selector at all.
It is part of a set of bracket pieces (and other graphic pieces)
in the range U+239B..U+23B1.
See discussion of the topic at:
http://www.unicode.org/forum/viewtopic.php?f=35t=206
See also Section 2.13 of UTR #25:
Yucca noted:
These glyphic pieces of symbols are only relevant and useful
in the context of mathematical typesetting programs like TeX.
I’m not sure whether TeX uses such characters at all. TeX is oriented
towards typesetting glyphs, often not caring that much about abstract
characters.
I don’t think the answer is directly deduced from UAX #9, because
it involves deciding where to insert a visible hyphen for display.
However, I think the correct answer here is your number two guess,
i.e. (in a RTL paragraph context):
-car SI TORRAC
A way to think about this, rather than
Richard Wordingham noted:
As U+2010 HYPHEN would result in text like 'car-', in an English
influenced context I would also go with 'car-'.
That's always a possibility, I suppose, but I'm not sure what
English influenced context means here.
The examples I just gave were for a RTL
Is it legitimate to truncate the context to a single line? The BiDi
algorithm is attempting to interpret unlabelled text as embedded text
(it's not an arbitrary dance), and in just one line there is no
indicator of whether the hyphen is part of the LTR text embedded in RTL
text.
For
And I think you need to distinguish between *proximate*
behavior in an editor and editing behavior in general.
Once a user enters editing mode, the expectation that we
(the software community writing text editors) have built,
in interaction with users, is that within reason, something that
you
Well, I actually don’t see. I took a look at the Sinhala you inserted in this
email. I cannot tell what you did at your input end (about “inserted all
joiners”),
but there are no actual joiners in the text itself. It displayed just fine
in my email (including the correct conditional formatting of
Please be very careful here. Having a non-empty value in field 1 of
UnicodeData.txt is *not* the same has having a Unicode name.
See:
http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf#G135207
for the gory details.
The Unicode name is formally defined in terms of the Name property,
which
Per continued:
I know it's not a name. My question was *why* control characters don't
*have* names like
CONTROL CHARACTER NULL
CONTROL CHARACTER START OF HEADING
CONTROL CHARACTER START OF TEXT
etc.
It would be so obvious to have it like that, so I assume there is some
Per asked:
In the DUCET file allkeys.txt,
http://www.unicode.org/Public/UCA/latest/allkeys.txt ,
there is (in 6429) as a comment for some characters.
I first didn't understand why, but then I realized those are control
characters that are part of ISO/EIC 6429.
Why is that pointed out
I agree that a clarification in the text would be better than
a comment in allkeys.txt. But I also think just changing (in 6429)
to (in ISO 6429) would be enough.
(Strange as it might seem for list regulars not everyone immediately
makes the right association from this four-digit number.
Eric,
The C version of the bidiref code does that, in part.
See the function br_ParseFileFormatB in brinput.c.
http://www.unicode.org/Public/PROGRAMS/BidiReferenceC/6.3.0/
It doesn't actually *transform* the BidiTest.txt file to output the other
format, but it
parses the input and then
Well, inconceivable? No. Inadvisable? yes.
First of all, such “comments” are not actually “comments”—they are the result
of a fairly cumbersome and drawn-out process of adding *normative* standardized
variation sequences to the standard.
Second – although this is a nit – FE0E and FE0F would
Stephan Stiller seems unconvinced by the various attempts to explain the
situation. Perhaps an authoritative explanation of the textual history might
assist.
Stephan demands an answer:
I want to know why the Glossary claims that surrogate code points are
[r]eserved for use by UTF-16.
Reason
Stephan Stiller noted:
Maybe ... and the origin of the single-glyph ellipsis remains a mystery
to me.
As Philippe surmised, it is a compatibility character, originally included
in the Unicode 1.0 repertoire for cross-mapping to existing legacy
encodings:
Code Page 932: 0x81 0x64
Code Page
I wrote:
As Philippe surmised, it is a compatibility character, originally included
in the Unicode 1.0 repertoire for cross-mapping to existing legacy
encodings:
Code Page 932: 0x81 0x64
Code Page 949: 0xA1 0xA6
Asmus responded:
which just pushes that question forward in time...
Steffen,
FYI, Unicode 7.0, when it comes out, will have another entire
bicameral (casing) script added to it: Warang Citi. And when
Old Hungarian is finally published, at some point after Unicode 7.0,
that will be *another* bicameral script added. It is unlikely that those
two will be the last.
Yucca asked:
As far as I can see, the document summarizes an agreement in an ad hoc
meeting. So it’s not late at all to raise objections, is it?
It is way, way, waaay too late to raise objections for these two.
Those characters are *published* in ISO/IEC 10646:2011 Amendment 1.
They were
Steffen,
Sure. You encounter this problem for any multi-byte EBCDIC-based
character encoding. In fact for any single-byte EBCDIC-based character
encoding, as well. The EBCDIC control that corresponds to a line feed is
either 0x15 or 0x25, depending on revisions. But you wouldn't ordinarily
run
The text in question is not exactly new to Unicode 6.2, probably goes
back to around the time
UTF-8 and UTF-16 were added over a decade ago. Getting a single question
on this passage after all these years would seem to indicate that
confusion isn't exactly rampant.
Just to address the
Steffen Daode Nurpmeso continued:
Hmm. To me, this raises the question why these constraints were
introduced at all. Imho either one adds constraints due to solid
considerations, and enforces them after some period of backward
compatibility, or there simply should be no constraints.
What
Poring back over this voluminous thread to Stephan Stiller's original question:
If one wants to indicate vowel length for the length-ambiguous vowels α,
ι, υ in Ancient Greek, one writes ᾱ, ῑ, ῡ. Is there a reason for why
there are no diacritic-precomposed characters? I guess it's because
On 7/30/2013 3:27 PM, Asmus Freytag wrote:
architectures that depended on swapping character sets (code
pages) in mid stream
I thought systems were usually married to a particular code page. I'm
wondering where (historically) you'd actually change to a different
code page
Steffen Daode Nurpmeso observed:
Hello, in UAX #44 i read
Simple_Titlecase_Mapping ...
Note: If this field is null, then the Simple_Titlecase_Mapping
is the same as the Simple_Uppercase_Mapping for this character.
So a parser has to be aware of this, automatically falling back
Suppose that these hex bytes:
C3 83 C2 B1
show up in a message and the message contains no hint what its encoding is.
Perhaps it is 8859-1, in which case the message consists of four 1-byte
characters:
C3 = Ã
83 = the “no break here” character
C2 = Â
B1 = ±
Perhaps it
Richard Wordingham asked:
How many examples do I need to collect to add Tai Tham to the script
extensions property for ... ?
IMO, a couple well-documented examples ought to suffice.
But, this query raises a couple further questions for me regarding
the scalability and maintenance of
How to write a mail like this:
When you arrive at Madrid airport, follow the sign that looks like this: [?]
Even if the font library supports all needed symbols, it will be easier to
send a photo than to choose the sign from a huge Unicode symbols list.
Yep.
This discussion about signs is
William J.G. Overington asked:
Suppose that a member of the public sends a document that seeks discussion
by the Unicode Technical Committee about whether the scope of what
Unicode encodes should be extended in some particular regard, with the
member of the public writing about why he or she
However, now that I've got your hopes up on procedural grounds...
Getting on to the particulars:
I do have two particular reasons for asking.
2. My research.
There is a document entitled locse027_four_simulations.pdf available from
the following forum post.
Richard Wordingham wrote:
European digits (U+0030 to U+0039) may, since Unicode 6.1.0, be used
with variation selectors. As their primary purpose is for use with
u+20E3 COMBINING ENCLOSING KEYCAP, is it legitimate to fail to
recognise strings of digits with variation selectors as
Richard Wordingham wrote:
Actually, there is a subtle and nasty difference, but probably one that
will very rarely strike practical use. It's most obvious manifestation
is in the application of the UCA parametric tailoring
topVariable=u2FD5. U+2FD5 KANGXI RADICAL FLUTE is the last symbol in
Richard Wordingham wrote:
One of the changes from Version 6.1.0 to 6.2.0 of the the UCA (UTS#10)
was to changed weights from being 16 bits to just being general
non-negative integers. Was this just to accommodate the 4th weight in
DUCET (scheduled for deletion in Version 6.3.0), or is it
Richard Wordingham wrote:
It loosened up the spec, so that the spec itself didn't seem to be
requiring that each of the first 3 levels had to be expressed with a
full 16 bits in any collation element table.
I don't read it that way. But it did allow the 4th weight to go up to
10!
Jukka said:
The comments at the start of NamesList.txt say that it is
“semi-automatically derived from UnicodeData.txt”, but the information
you are referring to has actually been picked up from the code charts.
They contain both informative alias names and cross references.
The
Well, it isn't prohibited, so I guess you will need to be forever vigilant in
view of the possibility that somebody might get it in their head to encode some
combining mark that isn't already accounted for in Tibetan *and* that they
would simultaneously insist that a precomposed form of that
Does anyone feel up to rigorously justifying revisions to the concepts
and algorithms of FCD and canonical closure? Occasionally one will
encounter cases where the canonical closure is infinite - in these
cases, normalisation will be necessary regardless of the outcome of the
FCD check.
-Original Message-
From: ken.whist...@sap.com
Sent: Wednesday, January 23, 2013 10:48 AM
To: 'Costello, Roger L.'
Subject: RE: Why are the low surrogates numerically larger than the high
surrogates?
Why are the low surrogates numerically larger than the high surrogates?
That is,
what does that different from the RLI
U+2067/PDF U+2068? if it is the same, can we use U+2066 in HTML
replacing
bdi?
Code points 2066, 2067, and 2068 are unassigned. I presume you mean
U+202B RIGHT-TO-LEFT EMBEDDING (RLE) and U+202C POP DIRECTIONAL
FORMATTING.
No, actually, I think
Sorry, but I have to disagree here. If a list of strings contains items
with lone surrogates (garbage), then sorting them doesn't make the
garbage go away, even if the items may be sorted in correct order
according to some criterion.
Well, yeah, I wasn't claiming that the principled, correct
Philippe Verdy said:
Well then I don't know why you need a definition of an Unicode 16-bit
string. For me it just means exactly the same as 16-bit string, and
the encoding in it is not relevant given you can put anything in it
without even needing to be conformant to Unicode. So a Java string
Philippe also said:
... Reserving UTF-16 for what the stadnard discusses as a
16-bit string, except that it should still require UTF-16
conformance (no unpaired surrogates and no non-characters) ...
For those following along, conformance to UTF-16 does *NOT* require no
non-characters.
Martin,
The kind of situation Markus is talking about is illustrated particularly well
in collation. And there is a section 7.1.1 in UTS #10 specifically devoted to
this issue,:
http://www.unicode.org/reports/tr10/#Handline_Illformed
When weighting Unicode 16-bit strings for collation, you
http://www.unicode.org/reports/tr10/#Handline_Illformed
Grrr.
http://www.unicode.org/reports/tr10/#Handling_Illformed
I seem unable to handle ill-formed spelling today. :(
--Ken
I'm gonna take a wild stab here and assume that this is Q as the medieval
Latin abbreviation for quingenti, which usually means 500, but also gets
glossed just as a big number, as in milia quingenta thousands upon
thousands. Maybe some medieval scribe substituted a Q for |V| (with an
overscore
Stephan Stiller continued:
Occasionally the question is asked how many characters Unicode has. This
question has an answer in section D.1 of the Unicode Standard. I
suspect, however, that once in a while the motivation for asking this
question is to find out how much of Unicode has been used
Whoops!
http://www.unicode.org/alloc/CurrentAllocation.html
--Ken
The editors maintain some statistical information relevant to this fun
question
at:
http://www.unicode.org/alloc/CurrentAllocaiton.html
Yannis' use of the terminology not ... a valid string in Unicode is a little
confusing there.
A Unicode string with the sequence, say, U+0300, U+0061 (a combining grave
mark, followed by a), is valid Unicode in the sense that it just consists
of two Unicode characters in a sequence. It is
One of the reasons why the Unicode Standard avoids the term “valid string”, is
that it immediate begs the question, valid *for what*?
The Unicode string U+0061, U+, U+0062 is just a sequence of 3 Unicode
characters. It is valid *for* use in internal processing, because for my own
André Schappo asked:
Been looking at http://www.unicode.org/Public/UNIDATA/Jamo.txt
There appears to be 2 different romanizations at play in the file? One for the
short name and another for the full name
eg 1100; G # HANGUL CHOSEONG KIYEOK
I have searched unicode.org but cannot find
Well, in answering the question which was actually posed here:
1. ISO/IEC 10646 has absolutely nothing to say about this issue, because 10646
does not define case mapping at all.
2. The Unicode Standard *does* define case mapping, of course, as well as case
folding. The relevant details are in
The UCA algorithm itself has no opinion on this issue. It is simply a
specification of *how* to compare strings at multiple levels, given a
multi-level collation weight table.
The UCA *does* have a default behavior, of course, based on the DUCET table.
And the DUCET table puts all Unicode
Leo asked:
My question was narrower: assuming that the strings being compared are
words, could it be supported without any markup?
... where it refers to conditional weighting based on the (identified) word
boundary. And the answer to that is no, unless the word boundary was explicitly
Leo Broukhis said:
Granted, not yet, but by itself the argument is invalid. Unicode
collation rules are descriptive;
I'm not sure what you mean by that. UTS #10 is a *specification* of an
algorithm, with various options for tailoring and parameterization which make
it possible to
Your misunderstanding is at the highlighted statement below. Actually 0300 *is*
blocked from 0061 in this sequence, because it is preceded by a character with
the same canonical combining class (i.e. U+0305, ccc=230). A blocking context
is the preceding combining character either having ccc=0
Philippe is (apparently) referring to higher-level protocols for markup of
hieroglyphic text. See, e.g., Table 14-10 and Figure 14-2, p. 489 in Section
14.18, Egyptian Hieroglyphs in TUS 6.2:
http://www.unicode.org/versions/Unicode6.2.0/ch14.pdf
Similar kinds of higher-level protocols are
There isn't an actual problem here which needs a solution, satisfactory, or
otherwise. The persistence of the 17 planes may not be enough meme on this
list is an interesting phenomenon in itself, but has no practical impact on any
of the actual ongoing work on maintenance of the encoding
Actually, I think the omission here is the word canonical. In other words,
Section 16.4 should probably read:
The base character in a variation sequence is never a combining character or a
*canonical* decomposable character.
Note that with this addition, StandardizedVariants.txt poses no
The first 256 characters of the Unicode Standard *are* compatible with ISO/IEC
8859-1 (Latin-1), but you need to distinguish what happens for the graphic
characters from what happens for the control codes.
ISO 8859-1 defines *graphic* characters in the ranges 0x20..0x7E, 0xA0..0xFF.
Those are
Actually, what Buck really needs is Section 16.1 Control Codes:
http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf
That explains the situation for the *non* graphic characters in the range
U+..U+00FF, which is the source of the concern for Buck's skeptical
workmates, I'm sure.
--Ken
A IANA-registered character *map* is a very different animal from a character
encoding standard per se.
The actual character encoding standard, ISO/IEC 8859-1:1998 does not define the
C0 and C1 control codes (and never will). That was what I was quoting from.
A mapping table, on the other
No Unicode doesn't. But yes, is *does* follow that decoding C0/C1 control codes
produces a Unicode code point of equal value. RTFM. TUS 6.2, p. 544:
There are 65 code points set aside in the Unicode Standard for compatibility
with the C0 and C1 control codes defined in the ISO/IEC 2022
Yep.
--Ken
Latin1 explicitly gives no semantics to several byte values (for example 0x81),
but acknowleges that other standards will define their semantics.
Unicode provides code-points with equally-undefined semantics so that these
bytes can pass through without change.
This allows a
Marion Gunn wrote:
-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org]
On Behalf Of Marion Gunn
Sent: Wednesday, September 26, 2012 10:53 AM
To: 'Unicode List'
Subject: Re: VS: Mayan numerals
...
This simple request to encode Mayan numerals has
85 matches
Mail list logo