Hi all,
Is Egyptian Demotic on somebody's roadmap for Unicode?
(Egyptian Demotic is what's on the middle third of the Rosetta Stone.)
Stephan
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode
[CR:]
in now exotic styles where the letters /ĉ, ŝ, ẑ, ŋ/ were used as well
Interesting. ẑ, ĉ, ŝ (but not ŋ) have been part of most pinyin
descriptions at the end of dictionaries; ẑ, ĉ, ŝ are still listed in
Xiàndài Hànyǔ Cídiǎn's 6th edition. But de-facto noone uses them, and
I'd regard them
Hi Eric,
[We met at the UTC meeting.]
I.
Is it correct that: in bopomofo, the neutral (or light) tone is
represented by U+02D9 ˙ DOT ABOVE, and in the text representation,
that character follows the bopomofo characters of the syllable (just
like all the other characters for tones)
1.
On 9/19/2013 2:35 AM, Stephan Stiller wrote:
As far as I am aware, a proper 'null consonant' has only arisen when
it actually represents a glottal stop.
There's ㅇ in hangeul (Hangul; Korean). Hebrew ע was supposedly
first pharyngeal [ʕ], though it's nowadays standardly a glottal stop
[ʔ
As far as I am aware, a proper 'null consonant' has only arisen when it
actually represents a glottal stop.
There's ㅇ in hangeul (Hangul; Korean). Hebrew ע was supposedly first
pharyngeal [ʕ], though it's nowadays standardly a glottal stop [ʔ] or
null ∅ (and you don't even need need a hiatus
On 9/17/2013 10:54 PM, Asmus Freytag wrote:
On 9/17/2013 8:40 PM, Philippe Verdy wrote:
In what way does UTF-16 use surrogate code /points/? An
encoding form is a mapping. Let's look at this mapping:
* One _inputs_ scalar values (not surrogate code points).
In fact the input is
On 9/18/2013 12:02 AM, Stephan Stiller wrote:
That still doesn't mean surrogates are used by UTF-16
= 'That still doesn't mean surrogate_code point_s are used by UTF-16'
On 9/18/2013 2:42 AM, Philippe Verdy wrote:
There are scalar values used in so many other unrelated domains [...]
There is no risk for confusion with vectors or complex numbers or reals
or whatnot.
On 9/18/2013 8:34 AM, Asmus Freytag wrote:
I concur. Codepoint is the accepted way of referring
Instead of selectively agreeing with Philippe's writing, it would be
good to tell us why Glossary claims that surrogate code points are
[r]eserved for use by UTF-16 and why there are similar statements in
the Unicode book if
[AF:] [o]nce you add the UTF-prefix, you are, by force, speaking of
[AF:]
It is the wording in your posts that adds to the confusion.
My fundamental point is, has been, and continues to be that whenever
people use the more general word code point instead of the more
appropriate scalar value, that will add to the confusion. If you
make the presupposition
I have been told that Devanagari contains letters (or a letter) that
were invented merely to complete the rectangular C-V table; not sure to
what extent they (or it) were used subsequently.
Wiki
http://en.wikipedia.org/wiki/Devanagari
tells me about the letter ॡ (signifying ḹ, I assume
On 9/17/2013 5:27 PM, Asmus Freytag wrote:
On 9/17/2013 2:55 PM, Stephan Stiller wrote:
[AF:]
It is the wording in your posts that adds to the confusion.
My fundamental point is, has been, and continues to be that whenever
people use the more general word code point instead of the more
I have been told that Devanagari contains letters (or a letter) that were
invented merely to complete the rectangular C-V table; not sure to what
extent they (or it) were used subsequently.
In which reference is this mentioned?
I was referring to oral communication (above I wrote I have been
In what way does UTF-16 use surrogate code /points/? An encoding form
is a mapping. Let's look at this mapping:
* One _inputs_ scalar values (not surrogate code points).
* The encoding form will _output_ a short sequence of encoding
form–specific code units. (Various voices on this list
Twitter - Until recently, characters outside the BMP resulted in a Counter
decrement of 2 and BMP characters gave a decrement of 1. Not sure when the change
happened but now both BMP non BMP characters result in a decrement of 1
Yes!! How might that have happened? ;-)
And the date line of
① Twitter - [...]
② Sina Weibo - [...]
About a year ago I blogged about it
http://schappo.blogspot.co.uk/2012/10/weibo-character-count.html
And your post on Twitter is this one:
http://schappo.blogspot.co.uk/2012/10/twitter-character-count.html
Stephan
them up. The latter interpretation seemed to derive
from terminological imprecision at first, but my concern and suspicion
turned out to be spot-on with what Twitter did historically.
On 9/16/2013 7:19 AM, Philippe Verdy wrote:
2013/9/16 Stephan Stiller stephan.stil...@gmail.com
On 9/16/2013 7:48 AM, Stephan Stiller wrote:
or count code points corresponding to code units because, well, you
can match them up
= or count code points corresponding to UTF-16 code units; those
happen to be BMP code points.
Twitter has been claiming since /at least/ April 2012 that they're
On 9/14/2013 6:24 AM, Michael Everson wrote:
It facilitates comment by those who are reviewing the text.
If you add proofreaders' marks to an especially difficult manuscript,
maybe. I've barely seen annotated papers with comments that would not
have fit into the margins, and there's still the
On 9/15/2013 1:04 PM, Doug Ewell wrote:
André Schappo wrote:
U+2026 is useful for microblogs when one is looking to save characters
Not if the microblog is in UTF-8, as almost all are.
That's an astute observation, but André was talking about input limits
On 9/15/2013 3:07 PM, Phillips, Addison wrote:
Not if the limit is counted in characters and not in bytes. Twitter,
for example, counts code points in the NFC representation of a tweet.
character, code point – these are confusing words :-)
From the link it isn't entirely clear whether they
(a)
Stephan Stiller wrote:
From the link it isn't entirely clear whether they
(a) count scalar values of NFC or
(b) count code points of NFC.
Are they not the same thing, except for surrogates?
Conceptually no, but numerically yes – you are right in that regard, and
I wasn't precise in my
Doug wrote me:
You're not confusing code point with code unit, are you?
Thanks for the note.
I think what you say is that I thought (or meant to write) by first
representing the sequence of scalar values in an encoding form and then
counting [code points typecast from] code _units_. I think
You've quoted the sentence out of its context (note the then word
which indicates this context). I do not support this practice.
Philippe, within my message you quote here isn't exactly precise about
context, is it :-)
I think there's a misunderstanding. My annoyance isn't in principle with
[ME:]
Books never used it. The tradition in typing was developed to assist
typesetters to navigate the typewritten text they were setting. The
typesetters never put two spaces after a full stop.
I'm looking at what looks like a US edition/printing (1902) of the
US-American novel Moby-Dick:
On 9/14/2013 3:42 AM, Michael Everson wrote:
On 14 Sep 2013, at 02:30, Stephan Stiller stephan.stil...@gmail.com wrote:
This means that this dot will then need to be followed by two spaces when it is
used as a sentence-ending period.
This tradition is no longer current in the US. Though it's
Hi Philippe,
i.e. (...). at end of a truncated sentence or . (...) at
start of the next truncated sentence
Well, for citations in German I've generally seen [...], and for
English I've seen both [...] and ..., but not (...).
I included it them in my sentence (parentheses or
I've never seen it in math proper, is what I meant, but ...
The { [ ( ) ] } hierarchy is used in chemical nomenclature. It is
specified by IUPAC (International Union of Pure and Applied Chemistry).
For example:
acetone
(/R/)-/O/-{2-[4-(α,α,α-trifluoro-/p/-tolyloxy)phenoxy]propionyl}oxime
I dd not speak about inter-word spacing (this cont affect the
rendering of ellipsis itself) but about inter-letter spacing.
But the context I provided was that some people ask for . . .[ .], as
ugly as it is :-) And, again, the precise ideal spacing is a matter of
typographic design; you can
Once you've increased the width of these interword spaces to their
maximum, all the characters (and these increased spaces) should be
justified using interletter spacing, and this extra interletter
spacing should be applied as well between the dots of the ellipsis
(showing that they are
Once you've increased the width of these interword spaces to
their maximum, all the characters (and these increased spaces)
should be justified using interletter spacing, and this extra
interletter spacing should be applied as well between the dots
of the
[PV:]
But then the existing ellipsis is not a good candidate because it has
the incorrect metrics where it should use the sinographic metrics.
[...] But the encoded ELLIPSIS does not fit correctly there.
But I think Chinese fonts take care of that.
Stephan
Exactly my thoughts:
In fonts commonly used for word processing and desktop publishing,
HORIZONTAL ELLIPSIS is usually not that well designed.
To me the dots appear too close in plenty of fonts.
But I think that the most common cause of the appearance of HORIZONTAL
ELLIPSIS is that Microsoft
Hi Philippe,
This means that this dot will then need to be followed by two spaces
when it is used as a sentence-ending period.
This tradition is no longer current in the US. Though it's obvious there
are still plenty of middle and high school–level teachers and
college-level writing
:-)
Lots of people still do this. I did until a year or two ago.
I also use non-standard punctuation, but I tend to know what majority
practice is, and when I deviate it's intentional. I don't know about
you, but nearly everyone who tells me that you should use two spaces
(should? says who?)
This tradition is persistant.
Persistent where?
Lots of people
Lots of people who and how many?
Go to a bookstore or library, pick 100 items randomly, and report. If
you want to make a case that it's majority or significant usage in
personal correspondence or outside of professional
This tradition is persistant.
Persistent where?
This is already replied within my message you quote here.
Lots of people
Lots of people who
Same remark.
So there are many contributors, on the English Wikipedia. What does
many mean? I doubt double spacing of sentences is
Talking about which ...
I confess I usually type a Danish Ø for convenience when I'm using this, though
for publication I would tend to substitute the proper ∅.
Whenever I saw the empty set symbol in printed math literature in
Germany, it closely resembled Ø; I don't think I ever saw a
Regarding the empty set, the page
http://jeff560.tripod.com/set.html
rather convincingly attributes the symbol to André Weil, who says
that it was inspired by the Norwegian letter “Ø”.
Well, if one looks at earlier editions of the Éléments, the symbol is
clearly not printed as
I confess I usually type a Danish Ø for convenience when I'm
using this, though for publication I would tend to substitute
the proper ∅.
Whenever I saw the empty set symbol in printed math literature in
Germany, it closely resembled Ø; I don't think I ever saw a
The notation { } is quite correct. It just isn’t an atomic symbol for
the empty set but an expression consisting of the two characters “{”
and “}”, with a list (here, an empty list) of elements between them.
Reminds me of typographically composite stuff that has its own scalar
value (code
The situation with {} is very similar to the situation with 0̸ for
the empty set and with \ for set subtraction. The Knuth's version of TeX
was designed for typesetting his books, and he (probably) did not
encounter situations where the meaning of these symbols is ambiguous.
When AMS was
Hi Philippe,
I disagree. For me your spaced-out ellipsis (. . .) is not an
ellipsis but are horizontal rulers (typically used in tables or input
forms) to facilitate the reading of tabular data.
I disagree with CMOS prescription in this case, just as you do, but the
prescription exists,
Hi Gerrit,
I have been aiming at creating a blackletter font
(http://unifraktur.sourceforge.net/maguntia.html)
Cool!
• The four “required” ligatures ch, ck, ſt and tz, which were never
separated in typesetting. These can be realised in the very same way
as antiqua ligatures.
Your page draws
On 9/11/2013 5:56 AM, Gerrit Ansmann wrote:
That’s correct, but that did not seem to stop people from using a long
s in Antiqua from time to time. There are a lot of post-1901 Antiqua
display fonts that contain a long s as well as examples from normal
text. This very rarely happens even today:
For Web formats (HTML, etc.), the answer is no.
The obvious follow-up to the list: It'd be interesting to know where the
answer is yes.
People will occasionally mention ISO/IEC 2022, which can be thought of
as a meta-encoding or encoding template or encoding constructor, but in
the normal
confusion isn't exactly rampant
I guess so.
But while we're splitting hairs:
There simply are two meanings for the word backup, which in and of
itself is nothing unusual, especially where one of them is the
ordinary sense of the term (not really a technical term).
In the IT domain, the to
On 8/28/2013 3:35 PM, Asmus Freytag wrote:
The original question was about combining UTF-8 and UTF-16 in the same
document.
/Not quite./ Hint: The original question is in the original email.
All good replies
It means the program needs to go back (a.k.a. back up)
but I'd say backtracking would make for better wording in TUS.
Stephan
On 8/5/2013 11:26 AM, Whistler, Ken wrote:
Inclusion of the precomposed characters now seen in the U+1FXX block was part
of the price of the merger. What was included was precisely the repertoire
requested by Greece, and no attempt was made to further rationalize forms
including macrons for
[from RW:]
/For metrical purposes/, we don't know whether the syllable is open or
closed until we know what comes next. [emphasis added]
About that you are right, and it was an oversight on both our parts. But
the dictionary also contains πράσσω with ᾱ in an annotation, and the
weight of
On 8/4/2013 2:59 PM, Richard Wordingham wrote:
The CLDR does not yet support Ancient Greek! [...] Vowels with plain COMBINING
BREVE and COMBINING MACRON don't make to the list of auxiliary exemplar
characters for
Modern Greek.
This is a non-sequitur; why would they for Modern Greek (Dimotiki),
Most of the polytonic precomposed vowels are in the auxiliary exemplars for
Modern Greek.
I don't know – probably because of the Katharevousa legacy and the fact
that Ancient Greek lives on in literary idioms, for which you ordinarily
don't use a macron for reasons of orthographic
Please bear in mind that polytonic vowels ARE used in the language
called Modern Greek.
/Because/ of the Ancient/Attic heritage living on via Katharevousa or
the occasional person persisting in polytonic orthography. In any case,
modern writing has traditionally not used macrons (and certain
I've seen information concerning this
we can no longer encode new precomposed characters for grapheme
clusters that are already encoded in any existing standard form
many times, though I'm not in a position to verify all of your content.
I'm also not proposing to add precomposed
/[One consequence of the string policy is that ]/we can no longer
encode new precomposed characters for grapheme clusters that are
already encoded in any existing standard form/[.]/
And you've truncated the end of my sentence
Well, I have not, unless you really want to count that
There are a number of box characters in the vicinity of U+27FB
You mean U+25FB.
U+25A1 [I think: maybe]
[and]
For diamond [...] U+25CA [I think: no]
Have you read my previous discussion and looked at UTR 25 (p. 20 and
also Ideal Sizes on p. 19)?
U+25AB [and] U+25FD
Definitely too
Hi,
If one wants to indicate vowel length for the length-ambiguous vowels α,
ι, υ in Ancient Greek, one writes ᾱ, ῑ, ῡ. Is there a reason for why
there are no diacritic-precomposed characters? I guess it's because
macron usage is rare in orthographic practice, even though vowel length
here
Characters restricted to dictionaries are generally not well
supported.
And modern textbooks in a modern world :-)
The practice in Scott and
Liddell is to reserve ᾱ, ῑ and ῡ for a note after the dictionary entry.
Liddell Scott is old, just like Lewis Short. We've moved on since
then, and
The practice in Scott and Liddell is to reserve ᾱ, ῑ and ῡ for a
note after the dictionary entry.
I'm looking at Liddell-Scott-/Jones/ here http://www.tlg.uci.edu/lsj/
and at old pdf's of Liddell Scott [only] by Google, and I cannot
easily confirm your statement. Perhaps it holds for
On 7/30/2013 3:27 PM, Asmus Freytag wrote:
architectures that depended on swapping character sets (code pages)
in mid stream
I thought systems were usually married to a particular code page. I'm
wondering where (historically) you'd actually change to a different code
page mid-stream.
I have a question regarding the supported Unicode code page.
There are no Unicode code pages.
I guess there is the question of what exactly a codepage is when you
consider complicated encodings, esp stateful ones. But I always think of
Unicode as one giant abstract codepage, and Unicode
What is wrong with using DIAMOND OPERATOR?
wrong is strong wording and goes beyond what I suggested or implied,
but it's not clear to a user of Unicode that it's the right fit either.
There are a couple of indicators factoring in:
* The charts mention modal logic in conjunction with ◻
Why not contact the relevant publishers and find out what they are using?
Why not contact the relevant governments and find out what they're
using in order to solve /_*all*_/ encoding issues for /_*all*_/
languages and writing systems within a day? :-)
Publishers use metal type (or various
Hi Jörg,
Thanks for the info!
U+25C7 WHITE DIAMOND
is the best choice
I'm with you in that for now I'll go with
⟨◻ (U+25FB), ◇ (U+25C7)⟩
as the pair of choice, pending further decisions; see also what I'm
writing further down. Or objections from experts stating that the symbol
Hi all,
Modal logic uses a box and a diamond (this is how they're informally
called) as operators (accepting one formula and returning another) to
denote necessity and possibility, resp. Older texts might use the
letters L and M (resp). Which Unicode codepoints do modal box and
diamond
Hi Richard,
I know of standards for transcribing foreign alphabets (by /target/
locale – Are they relevant here? If so, which?) [...]
This may well depend on both source and target locale! How often
will locale have to be broken down on a non-local basis? Different
newspapers in the same
Hi Jonathan,
I definitely appreciate the partial datapoints from your links, but
Google is your friend
by itself doesn't lead us closer to a real answer, and in this case I
think that there are at least some good answers, and in any case some
answers will be better than others.
This
My impression is that US customs officials are either quite
knowledgeable or quite tolerant on such issues (or a mixture of both).
The same applies to customs officials in other countries I have
traveled to, and other people at airports and such.
Thanks. (And, I don't have the knowledge to
Hey Jonathan,
The official transliteration for Hebrew to the Latin script is obsolete
What is the latest recommended scheme?
and the situation in this country is a mess
Let me guess: it has to do with the number of spelling variants in names
of /aliyah/ immigrants? I've always been
See http://www.icao.int/publications/Documents/9303_p1_v1_cons_fr.pdf ,
especially Appendice 8 (p IV-50). The English version is available as
http://www.icao.int/publications/Documents/9303_p1_v1_cons_en.pdf ,
especially Appendix 8 (p IV-47).
I suppose you can't go wrong with what your own
I suppose you can't go wrong with what your own passport says
On second thought ...
* disallowed: Ä↛A , Ö↛O , Ü↛U (as are: Å↛A , Ø↛O)
... I have a Turkish friend for whom it is Ö→O, not OE. This calls into
question the general applicability of these rules. A few years ago he
also told
Hi folks,
For languages whose alphabets aren't too far apart (I'm thinking mostly
of the set of Latin-derived alphabets), what is a good place for finding
out how letter replacements for letters that are missing in a different
country/locale are done?
For example, how will an Icelander
On Fri, Jun 14, 2013 at 10:45 AM, Michael Fayez
michaelfa...@hotmail.com mailto:michaelfa...@hotmail.com wrote:
I noticed that double small parentheses that are used in
professional printing in Arabic presses are not encoded in Unicode.
[...]http://i.imgur.com/aAgRDq1.jpg
So
Thank you, خالد and Richard.
there is only one Indic mark I can think of for which
the issue of component association arises, and that is the nukta
That is good to know, given the complexity of the Indic scripts.
Other thoughts:
* One could simply break up Arabic ligatures in need of
Hi,
How is the placement of vowel marks around ligatures handled in Arabic text?
Does anyone have good pointers on this topic?
My guess is that this does not come up often (just like the topic of
pointing for handwritten Hebrew), as vowel marks are mostly not added in
ordinary text.
Familiarity with a writing system makes the non-obvious parts
comprehensible, as can context.
The work is a thorough listing of usage instances that the authors could
encounter in the wild. My informants can't recall ever having seen many
of these characters. They wouldn't use them, and that
For me non-standardized' means there is not one recognized standard,
this does not mean that things are completely unstable, nor that there
are no traditions of what character is used for what word that have
been passed down for many generations.
/As I stated/, for a decent number of
The way the Cheung-Bauer list was compiled certainly hard to see how
most of the characters would be in widely known.
I'd need to look at CB again for accurate numbers, but to some extent
it's simply because some syllable-morphemes are listed with many
different attested possibilities. So
http://www.unicode.org/reports/tr38/ does a good summary of the
possibilities.
Which and where?
Trying to fold from one locale to another, which is what folding
from traditional to simplified would be is not a good idea, best
practice is not bear in mind the locale being used, and do
The situation also sends to be complex once one steps putside of
Putonghua.
Given that the situation there is a lack of standardization (and a lack
of tables laying out variant spellings), I don't think anything other
than radical, hand-tuned folding to cover all possibilities is sensible
to
As far as general folding is concerned, performing conversion (whether
it's word-based or not and even if it's locale-tailored) and then a
strict search will let you miss out on the z-variation you find in the
wild (because of true variation or of misspellings), and a more generous
inclusion
I.
Which and where?
Section 3.7.1 Simplified and Traditional Chinese Variants talks about
converting between Simplified and Traditional Chinese.
You wrote this
http://www.unicode.org/reports/tr38/ does a good summary of
the possibilities.
in response to my inquiry
better word choice:
lexical variation - orthographic variation (in my prev email)
So we both agree that Unihan is not designed to tell people how to
covert between traditional and simplified characters.
Yep.
Though some confusion as what other questions are being discussed here.
I think I misused the expression folding at some point. But the
original query explicitly
Hi John,
This is one of those questions that I've been wondering about as well
... my guess would be yes that should work (and dealing with z-variants
is something you'll likely need to do anyways), but there *must* be
some published algorithm out there that specifically addresses the issue
simplified [is] better thought of as abbreviated
Part of this is a terminological argument. The historical situation is
indeed more complicated than many people know, but the truth is also
that irrespective of eg people's past or present usage in handwriting
there have (in the past and esp
Excellent question and points from Albrecht Dreiheller.
[AD:]
So the _receptive vocabulary_ might be pretty big for many people.
[...]
So the _productive vocabulary_ of symbols will always be very, very small.
I was thinking a similar thing, and I'm inclined to agree.
But I know of
The Noun Project seem determined to create a pictogram for every noun,
and many short phrases:
See http://blog.thenounproject.com/
Huh.
What are the constraints on the symbols; eg: what resolution can the
symbols be (so that we don't simply use detailed high-res pictures)? Are
there any
Sing-Writing has both a normative form, to be generated by computer
programs, and a handwriting form allowing more freedom. It has been
developed using signs that are not so complicate to reproduce in a
meaningful way.
Could you provide a link with signwritten sentences in the
[Charlie Ruland:]
The Unicode Consortium is prepared to encode all characters that can
be shown to be in actual use.
Are you sure there is a precedent for what is essentially markup for a
system of (alpha)numerical IDs?
Stephan
what the western world knows
as „calligraphie“, e.g., in Germany elementary school kids become
graded for the prettiness of their handwriting.
I've only ever encountered the word Kalligraphie (now preferred:
Kalligrafie) in the meaning of artistic writing in Germany. If the
word is also used
sign-writing
SignWriting is also difficult to write.
naturelly evolved
I will be very curious to see the result after a bit of evolution (I
hope there will be some), with a system that can actually be written
easily by hand (or at least input quickly with the right input method)
and that
In India you could have telegrams
containing such sentences delivered in any of the major Indian
regional languages.
This was a good idea in the days of the low-bandwidth telegraph
And it was a domain-restricted application.
Stephan
SignWriting is also difficult to write.
Not necessarily more than those that learn writing Chinese.
Learning how to write Chinese is difficult. It only takes like 6.5
years of schooling, and when students go abroad for college, they
quickly forget how to write many characters. In fact,
Sing-Writung has both a normative form, to be generated by computer
programs, and a handwriting form allowing more freedom. It has been
developed using signs that are not so complicate to reproduce in a
meaningful way.
Could you provide a link with signwritten sentences in the /latest/
I am wondering whether it would be a good idea for there to be a list of
numbered preset sentences that are an international standard and then if Google
chose to front end Google Translate with precise translations of that list of
sentences made by professional linguists who are native
Not perfect, perhaps, but perfectly comprehensible. And the application will
even
do a very decent job of text to speech for you.
and
The quality of the
translation for these kinds of applications has rapidly improved in recent years
Not that the ability of MT to deal with
As regards any possible case for encoding localizable sentences *as
characters*,
in my opinion, the train long ago left the station for that one.
Indeed, people have been devising systems for representing words and
sentences via ordinary numbers that worked just fine for at least 170
This one is incredible:
https://bugzilla.redhat.com/show_bug.cgi?id=922433
This sort of failure to perform input validation and/or escaping is also
a sign of bad software engineering in general. I recall an important CGI
form of my university refusing to let me submit because I input an
1 - 100 of 193 matches
Mail list logo