Re: Is the binaryness/textness of a data format a property?

2020-03-21 Thread Julian Bradfield via Unicode
On 2020-03-21, Eli Zaretskii via Unicode  wrote:
>> Date: Sat, 21 Mar 2020 11:13:40 -0600
>> From: Doug Ewell via Unicode 
>> 
>> Adam Borowski wrote:
>> 
>> > Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF
>> > or U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has
>> > its uses but is not well-formed Unicode.
>> 
>> I'd be interested in your elaboration on what these uses are.
>
> Emacs uses some of that for supporting charsets that cannot be mapped
> into Unicode.  GB18030 is one example of such charsets.  The internal
> representation of characters in Emacs is UTF-8, so it uses 5-byte
> UTF-8 like sequences to represent such characters.

My own (now >10 year old) Unicode adaptation of XEmacs does the same,
even for charsets that can be mapped into Unicode. To ensure complete
backward compatibility, it distinguishes "legacy" charsets from Unicode,
and only does conversion when requested.



Re: On the lack of a SQUARE TB glyph

2019-09-27 Thread Julian Bradfield via Unicode
On 2019-09-27, David Starner via Unicode  wrote:
> On Thu, Sep 26, 2019 at 8:57 PM Fred Brennan via Unicode
> wrote:
[snip]
>> There is no sequence of glyphs that could be logically mapped, unless you're
>> telling me to request that the sequence T  B be recommended for general
>> interchange as SQUARE TB? That's silly.
>
> Why is that silly? You've got an unbounded set of these; even the base
> prefixes EPTGMkhdmμnp (and da) crossed with bBmglWsAKNJCΩT (plus a
> bunch more), which is over 200 combinations without all the units, and
> there's some exponents encoded, so some of those will need to be
> encoded with exponents. And that's far from a complete list of what
> people might want as squares.

Wouldn't T  B 
be a better sequence? 
In fact, it would have been nice (expecially for mathematicians) if
all combining marks could have been applied to character sequences, by
means of some "high precedence ZWJ" that binds more tightly than
combination.
(Playing devil's advocate here, since I don't think maths is plain
text:)

Or one could allow IDS to have leaf components that are any
characters, not just ideographic characters, and then one could have
all sorts of fun.


acute-macron hybrid?

2019-04-30 Thread Julian Bradfield via Unicode
The celebrated Bosworth-Toller dictionary of Anglo-Saxon uses a
curious diacritic to mark long vowels. It may be described as a long
shallow acute with a small down-tick at the right.
It contrasts with an acute (quite steep in this typeface) used to mark
accented short vowels.
Both can be seen in the fifth line of the scan at
http://lexicon.ff.cuni.cz/png/oe_bosworthtoller/b0002.png

What is its appropriate Unicode representation?
As a lumper, I would use a macron, but I wonder what a splitter would
say.


mildly OT from bidi - curious email

2019-02-06 Thread Julian Bradfield via Unicode
The current bidi discussion prompts me to post a curiosity I received
today.

I ordered something from a (UK) company, and the payment receipt came
via Stripe. So far, so common. The curious thing is that the (entirely
ASCII) company name was enclosed in a left-to-right direction, thus:

Subject: Your Aaa Ltd receipt [#-]

where  and  are the bidi control characters.

I don't think I've seen this before - I wonder why it happened?

Also today I got an otherwise ASCII message where every paragraph
started with BOM (or ZWNBSP as my font prefers to call it). I see from
the web that people used to do this - anybody know what the most
common software packages that do it are?



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Julian Bradfield via Unicode
On 2019-01-27, Michael Everson via Unicode  wrote:
> On 27 Jan 2019, at 05:21, Richard Wordingham 
>  wrote:
>> The closing single inverted comma has a different origin to the apostrophe.
> No, it doesn’t, but you are welcome to try to prove your assertion. 

As far as I can tell from the easily accessible literature, the
apostrophe derives from an in-line manuscript mark that is a point
with a tail, while the quotation marks derive from a marginal mark
shaped like an arrowhead (like modern guillemets). What is your story
about them?

>> Is someone going to tell me there is an advantage in treating "men's” as one 
>> word but "dogs'" as two?  As I've said, the argument for encoding English 
>> apostrophes as U+2019 is that even with adequate keyboards, users cannot be 
>> relied upon to distinguish U+02BC and U+2019 - especially with no feedback. 
>> A writing system should choose one and stick with it.  User unreliability 
>> forces a compromise.
>
> Polynesian users need to 02BC to be visually distinguished from 2019. 
> European users don’t need the apostrophe to be visually distinguished from 
> 2019. The edge case of “dogs’” doesn’t convince me. In all my years of 
> typesetting I have never once noticed this, much less considered it a problem 
> that needed fixing.

You have a very low opinion of Polynesian users. People (as opposed to
computers) use context to remove ambiguity. Before we had to interact
with pedantic computers, we were rarely confused by the typewriter-induced
confusion of 1 and l and 0 and O (or, indeed, the use of symmetrical
quotation marks).
Now a sensible orthographic choice for a language using comma-like
letters would be to use guillemets for quotation, and while I don't
know (there being precious few modern Polynesian materials online), I
would guess that the languages of French Polynesia do that.
If, like Hawaiian, you're stuck with English-style quotation marks for
historical reasons, an obvious typographic solution is to thin-space
them, French-style. (See previous thread!). That seems visually
preferable to relying on a small difference in size of what is already
a small letter compared to everything else on the page.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Encoding italic

2019-01-21 Thread Julian Bradfield via Unicode
On 2019-01-21, James Kass via Unicode  wrote:
> Consider superscript/subscript digits as a similar styling issue. The 
> Wikipedia page for Romanization of Chinese includes information about 
> the Wade-Giles system’s tone marks, which are superscripted digits.
>
> https://en.wikipedia.org/wiki/Romanization_of_Chinese
>
> Copy/pasting an example from the page into plain-text results in “ma1, 
> ma2, ma3, ma4”, although the web page displays the letters as italic and 
> the digits as (italic) superscripts.  IMO, that’s simply wrong with 
> respect to the superscript digits and suboptimal with respect to the 
> italic letters.

Wade-Giles (which should be written with an en-dash, not a hyphen, if
we're going to be fussy - as indeed Wikipedia is) is obsolete, but one
could say the same about pinyin. However, printed pinyin with tones
almost invariably uses the combining diacritics; in email where most people
can't be bothered to write diacritics, tone numbers are written just
as you have written above, with a following ascii digit. (With the
proviso that Chinese speakers don't usually write tones at all when
they write in pinyin.) They're often written like that even in web
pages, where superscripts would be easy - see Victor Mair's frequent
Language Log posts about Chinese writing and printing.
This seems significantly less wrong to me that writing H2SO4 for
H2SO4 which is also common in plain text...


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: A last missing link for interoperable representation

2019-01-15 Thread Julian Bradfield via Unicode
On 2019-01-15, Philippe Verdy via Unicode  wrote:
> This is not for Mongolian and French wanted this space since long and it
> has a use even in English since centuries for fine typography.
> So no, NNBSP is definitely NOT "exotic whitespace". It's just that it was
> forgotten in the early stages of computing with legacy 8-bit encodings but
> it should have been in Unicode since the begining as its existence is
> proven long before the computing age (before ASCII, or even before Baudot
> and telegraphic systems). It has alsway been used by typographs, it has
> centuries of tradition in publishing. And it has always been recommended
> and still today for French for all books/papers publishers.

Do you expect people to encode all the variable justification spaces
between words by combining all the (numerous) spaces already available
in Unicode?
And how about the kerning between letters? If spacing of punctuation
is to be encoded instead of left to display algorithms, shouldn't you
also encode the kerns instead of leaving them to the font display
technology?

Oh, and what about dropped initials? They have been used in both
manuscripts and typography for many centuries - surely we must encode
them?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: A last missing link for interoperable representation

2019-01-14 Thread Julian Bradfield via Unicode
On 2019-01-14, James Kass via Unicode  wrote:
> Julian Bradfield wrote,
> > I have never seen a Unicode math alphabet character in email
> > outside this list.
>
> It's being done though.  Check this message from 2013 which includes the 
> following, copy/pasted from the web page into Notepad:
>
> 혗혈혙혛 혖혍 헔햳햮헭.향햱햠햬햤햶햮햱햪  © ퟮퟬퟭퟯ 햠햫햤햷 햦햱햠햸  
> 헀헂헍헁헎햻.햼허헆/헺헿헮헹헲혅헴헿헮혆
>
> https://apple.stackexchange.com/questions/104159/what-are-these-characters-and-how-can-i-use-them

Which makes the point very nicely. They're not being *used* to do maths,
they're being played with for purely decorative purposes, and moreover
in a way which breaks the actual intended use as a URL.
If you introduce random stuff into Unicode, people will play with it
(or use it for phishing).
The whole thread is, as it says, "what is this weird stuff"?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: A last missing link for interoperable representation

2019-01-13 Thread Julian Bradfield via Unicode
On 2019-01-13, James Kass via Unicode  wrote:
> यदि आप किसी रोटरी फोन से कॉल कर रहे हैं, तो कृपया स्टार (*) दबाएं।

> What happens with Devanagari text?  Should the user community refrain 
> from interchanging data because 1980s era software isn't Unicode aware?

Devanagari is an established writing system (which also doesn't need
separate letters for different typefaces). Those who wish to exchange
information in devanagari will use either an ISCII or Unicode system
with suitable font support.
Just as those who wish to exchange English text with typographic
detail will use a suitable typographic mark-up system with font
support, which will typically not interfere with plain text searching.
Even in a PDF document, "art nouveau" will appear as "art nouveau"
whatever font it's in.

Incidentally, a large chunk of my facebook feed is Indian politics,
and of that portion of it that is in Hindi or other Indian
languages, most is still written in ASCII transcription, even though
every web browser and social media application in common use surely
has full Unicode support these days. Sometimes using your own writing
system is just too much effort!

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: A last missing link for interoperable representation

2019-01-13 Thread Julian Bradfield via Unicode
On 2019-01-14, James Kass via Unicode  wrote:
> 퐴푟푡 푛표푢푣푒푎푢 seems a bit 푝푎푠푠é nowadays, as well.
>
> (Had to use mark-up for that “span” of a single letter in order to 
> indicate the proper letter form.  But the plain-text display looks crazy 
> with that HTML jive in it.)

Indeed. But
 _Art nouveau_ seems a bit _passé_ nowadays
looks fine and is understood even by those who have never annotated a
manuscript with proof corrections.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: A last missing link for interoperable representation

2019-01-13 Thread Julian Bradfield via Unicode
On 2019-01-13, Marcel Schneider via Unicode  wrote:
> As far as the information goes that was running until now on this List,
> Mathematicians are both using TeX and liking the Unicode math alphabets.

As Khaled has said, if they use them, it's because some software
designer has decided to use them to implement markup.
I have never seen a Unicode math alphabet character in email outside
this list.

> These statements make me fear that the font you are using might unsupport
> the NARROW NO-BREAK SPACE U+202F > <. If you see a question mark between

It displays as a space. As one would expect - I use fixed width fonts
for plain text.

> these pointy brackets, please let us know. Because then, You’re unable to
> read interoperably usable French text, too, as you’ll see double punctuation
> (eg "?!") where a single mark is intended, like here !

I see "like here !".
French text does not need narrow spacing any more than science does.
When doing typography, fifty centimetres is $50\thinspace\mathrm{cm}$;
in plain text, 50cm does just fine.
Likewise, normal French people writing email write "Quel idiot!", or
sometimes "Quel idiot !".

If you google that phrase on a few French websites, you'll see that
some (such as Larousse, whom one might expect to care about such
things) use no space before punctuation, while others (such as some
random T-shirt company) use an ASCII space.

The Académie Française, which by definition knows more about French
orthography than you do, uses full ASCII spaces before ? and ! on its
front page. Also after opening guillemets, which looks even more
stupid from an Anglophone perspective.

> Aiming at extending the subset of environments supporting correct typesetting

There are many fine programs, including TeX, for doing good
typesetting. Unicode is not about typesetting, it's about information
exchange and preservation.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: A last missing link for interoperable representation

2019-01-13 Thread Julian Bradfield via Unicode
On 2019-01-12, James Kass via Unicode  wrote:
> This is a math formula:
> a + b = b + a
> ... where the estimable "mathematician" used Latin letters from ASCII as 
> though they were math alphanumerics variables.

Yup, and it's immediately understandable by anyone reading on any
computer that understands ASCII.  That's why mathematicians write like
that in plain text.

> This is an italicized word:
> 푘푎푘푖푠푡표푐푟푎푐푦
> ... where the "geek" hacker used Latin italics letters from the math 
> alphanumeric range as though they were Latin italics letters.

It's a sequence of question marks unless you have an up to date
Unicode font set up (which, as it happens, I don't for the terminal in
which I read this mailing list). Since actual mathematicians don't use
the Unicode math alphabets, there's no strong incentive to get updated
fonts.

> Where's the harm?

You lose your audience for no reasons other than technogeekery. 


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: A last missing link for interoperable representation

2019-01-13 Thread Julian Bradfield via Unicode
On 2019-01-12, Richard Wordingham via Unicode  wrote:
> On Sat, 12 Jan 2019 10:57:26 + (GMT)
> Julian Bradfield via Unicode  wrote:
>
>> It's also fundamentally misguided. When I _italicize_ a word, I am
>> writing a word composed of (plain old) letters, and then styling the
>> word; I am not composing a new and different word ("_italicize_") that
>> is distinct from the old word ("italicize") by virtue of being made up
>> of different letters.
>
> And what happens when you capitalise a word for emphasis or to begin a
> sentence?  Is it no longer the same word?

Indeed. As has been observed up-thread, the casing idea is a dumb one!
We are, however, stuck with it because of legacy encoding transported
into Unicode. We aren't stuck with encoding fonts into Unicode.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: A last missing link for interoperable representation

2019-01-13 Thread Julian Bradfield via Unicode
On 2019-01-12, James Kass via Unicode  wrote:

> Sounds like you didn't try it.  VS characters are default ignorable.

By software that has a full understanding of Unicode. There is a very
large world out there of software that was written before Unicode was
dreamed of, let alone popular.

> apricot
> a︁p︁r︁i︁c︁o︁t︁
> Notepad finds them both if you type the word "apricot" into the search box.

What has Notepad to do with me?

> "But for plain text, it's crazy."
>
> Are you a member of the plain-text user community?

Certainly:)

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: A last missing link for interoperable representation

2019-01-12 Thread Julian Bradfield via Unicode
On 2019-01-11, James Kass via Unicode  wrote:
> Exactly.  William Overington has already posted a proof-of-concept here:
> https://forum.high-logic.com/viewtopic.php?f=10=7831
> ... using a P.U.A. character /in lieu/ of a combining formatting or VS 
> character.  The concept is straightforward and works properly with 
> existing technology.

It does not work with much existing technology. Interspersing extra
codepoints into what is otherwise plain text breaks all the existing
software that has not been, and never will be updated to deal with
arbitrarily complex algorithms required to do Unicode searching.
Somebody who need to search exotic East Asian text will know that they
need software that understands VS, but a plain ordinary language user
is unlikely to have any idea that VS exist, or that their searches
will mysteriously fail if they use this snazzy new pseudo-plain-text
italicization technique

It's also fundamentally misguided. When I _italicize_ a word, I am
writing a word composed of (plain old) letters, and then styling the
word; I am not composing a new and different word ("_italicize_") that
is distinct from the old word ("italicize") by virtue of being made up
of different letters.

I think the VS or combining format character approach *would* have
been a better way to deal with the mess of mathematical alphabets,
because for mathematicians, *b* is a distinct symbol from b, and while
there may be correlated use of alphabets, there need be no connection
whatever between something notated b and something notated *b*.

But for plain text, it's crazy.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: A sign/abbreviation for "magister"

2018-11-02 Thread Julian Bradfield via Unicode
On 2018-11-02, James Kass via Unicode  wrote:
> Alphabetic script users write things the way they are spelled and spell 
> things the way they are written.  The abbreviation in question as 
> written consists of three recognizable symbols.  An "M", a superscript 
> "r", and an equal sign (= two lines).  It can be printed, handwritten, 

That's not true. The squiggle under the r is a squiggle - it is a
matter of interpretation (on which there was some discussion a hundred
messages up-thread or so :) whether it was intended to be = .
Just as it is a matter of interpretation whether the superscript and
squiggle were deeply meaningful to the writer, or whether they were
just a stylistic flourish for Mr.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: A sign/abbreviation for "magister"

2018-10-31 Thread Julian Bradfield via Unicode
On 2018-10-31, Marcel Schneider via Unicode  wrote:

> Preformatted Unicode superscript small letters are meeting the French 
> superscript 
> requirement, that is found in:
> http://www.academie-francaise.fr/abreviations-des-adjectifs-numeraux
> (in French). This brief article focuses on the spelling of the indicators, 
> without questioning the fact that they are superscript.

When one does question the Académie about the fact, this is their
reply:

 Le fait de placer en exposant ces mentions est de convention
 typographique ; il convient donc de le faire. Les seules exceptions
 sont pour Mme et Mlle.

which, if my understanding of "convient" is correct, carefully does
quite say that it is *wrong* not to superscript, but that one should
superscript when one can because that is the convention in typography.

My original question was:

 Dans les imprimés ou dans le manuscrit on écrit "1er, 
45e"
 etc. (J'utilise l'indication HTML pour les lettres supérieures.)

 La question est: est-ce que les lettres supérieures sont
 *obligatoires*, ou sont-ils simplement une question de style? C'est à
 dire, si on écrit "1er, 45e" etc., est-ce une erreur, ou un style
 simple mais correct? 

I did not think that their Dictionary desk would understand the
concept of plain text, so I didn't ask explicitly for their opinions
on encoding :)

Which takes us back to when typography is plain text...

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: second attempt (was: A sign/abbreviation for "magister")

2018-10-31 Thread Julian Bradfield via Unicode
On 2018-10-31, Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode 
 wrote:
> On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote:

[ as did I in private mail ]

>> The abbreviation in the postcard, rendered in
>> plain text, is "Mr".
>
> The relevant fragment of the postcard in a loose translation is
>
> Use the following address:   ...
>  is the abbreviation of magister.
>
> I don't think your rendering
>
>Mr is the abbreviation of magister.
>
> has the same meaning.

I do, for the reasons stated by many.

If the topic were a study of the ways in which people indicate
abbreviations by typographic or manuscript styling, then it would be
important to know the exact form of the marks; but that is not plain
text. One cannot expect to discuss detailed technical questions using only
plain text, other than by using language to describe the details.

> Please note that I didn't asked *whether* to encode the abbreviation. I
> asked *how* to do it.

Doug and I have argued that the encoding is "Mr". Further detail can be
given in natural language as a note. You could use the various hacks
you've discussed, with modifier letters; but that is not "encoding",
that is "abusing Unicode to do markup". At least, that's the view I
take!

Perhaps a more challenging case is that at one time in English, it was
common to write and print "the" as "ye" (from older
"þe"). Here, there is actually a potential contrast between
the forms "ye" ("the") and "ye" (2nd plural pronoun), and
the contrast could be realized: "the/ye idle braggarts are a curse
upon England". Is the encoding of "ye" to be "ye" or "the"?
A hard-line plain-texter such as myself would probably argue for
"the".








-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: A sign/abbreviation for "magister"

2018-10-30 Thread Julian Bradfield via Unicode
On 2018-10-30, Marcel Schneider via Unicode  wrote:
> Dr Bradfield just added on 30/10/2018 at 14:21 something that I didn’t 
> know when replying to Dr Ewell on 29/10/2018 at 21:27:

>> The English abbreviation Mr was also frequently superscripted in the
>> 15th-17th centuries, and that didn't mean anything special either - it
>> was just part of a general convention of superscripting the final
>> segment of abbreviations, probably inherited from manuscript practice.
>
> So English dropped the superscript requirement for common abbreviations 

Who said anything about requirement? I didn't.
The practice of using superscripts to end abbreviations is alive and
well in manuscript - I do it myself in writting notes for myself. For
example, "condition" I will often write as "condn", and
"equation" as "eqn".

> in the 17ᵗʰ or 18ᵗʰ century to keep it only for ordinals. Should Unicode 

What do you mean, for ordinals? If you mean 1st, 2nd etc., then there
is not now (when superscripting looks very old-fashioned) and never
has been any requirement to superscript them, as far as I know -
though since the OED doesn't have an entry for "1st", I can't easily
check.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: A sign/abbreviation for "magister"

2018-10-30 Thread Julian Bradfield via Unicode
On 2018-10-30, James Kass via Unicode  wrote:
> (Still responding to Ken Whistler's post)

> Do you know the difference between H₂SO₄ and H2SO4?  One of them is a 
> chemical formula, the other one is a license plate number. T̲h̲a̲t̲ is 
> not a stylistic difference /in my book/.  (Emphasis added.)

Yes. In chemical notation, sub/superscripting is semantically
significant.
That's not the case for abbreviations: the choice of Mr or any of its
superscripted and decorated variations is not semantically
significant.
The English abbreviation Mr was also frequently superscripted in the
15th-17th centuries, and that didn't mean anything special either - it
was just part of a general convention of superscripting the final
segment of abbreviations, probably inherited from manuscript practice.

> But suppose both those strings were *intended* to represent the chemical 
> formula?  Then one of them would be optimally correct; the other one... meh.
>
> Now what if we were future historians given the task of encoding both of 
> those strings, from two different sources, and had no idea what those 
> two strings were supposed to represent?  Wouldn't it be best to preserve 
> both strings intact, as they were originally written?

Indeed - and that means an image, not any textual representation. The
typeface might be significant too.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process)

2018-08-21 Thread Julian Bradfield via Unicode
On 2018-08-20, Mark E. Shoulson via Unicode  wrote:
> Moreover, they [William's pronoun symbols] are once again an attempt to 
> shoehorn Overington's pet 
> project, "language-independent sentences/words," which are still 
> generally deemed out of scope for Unicode.

I find it increasingly hard to understand why William's project is out
of scope (apart from the "demonstrate use first, then encode"
principle, which is in any case not applied to emoji), when emoji are
language-independent words - or even sentences: the GROWING HEART
emoji is (I presume) supposed to be a language-independent way of
saying "I love you more every day". Which seems rather more
fatuous as a thing to put in a writing-systems standard than the
things I think William would want.

Not that I want to hear any more about William's unmentionables; I
just wish emoji were equally unmentionable.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Thoughts on Emoji Selection Process

2018-08-11 Thread Julian Bradfield via Unicode
On 2018-08-11, Charlotte Buff via Unicode  wrote:
> There is no semantic difference between a softball and a baseball. They are
> literally the same object, just in slightly different sizes. There isn’t a
> semantic difference between a squirrel and a chipmunk either (mainly
> because they don’t represent anything beyond their own identities just like
> the majority of modern emoji inventions), but at the very least they are
> *different things*.

I think you don't understand the meaning of "semantic", "literally",
or "the same". Which is a pity, because I'm all in sympathy with your
general attitude to emoji and Unicode.

I'm not just being pedantic - I can't even work out what you're
attempting to say in this paragraph.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: 0027, 02BC, 2019, or a new character?

2018-01-27 Thread Julian Bradfield via Unicode
On 2018-01-26, Richard Wordingham via Unicode  wrote:
> Some systems (or admins) have been totally defeated by even the ASCII
> version of ʹO’Sullivanʹ.  That bodes ill for Kazakhs.

The head (about to be ex-head) of my university is Sir Timothy O'Shea.
On the student record system, it is impossible to search for students
called O'Shea (I have one). I suppose it doesn't sanitize correctly -
I haven't tried looking for little Bobby Tables yet. It hadn't
occurred to me to check, but of course searching for O’Shea doesn't
work either, as they usually enter their own names into the initial
record, and use 0027.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Counting Devanagari Aksharas

2017-04-22 Thread Julian Bradfield via Unicode
On 2017-04-22, Eli Zaretskii via Unicode  wrote:
>> From: Richard Wordingham via Unicode 
[...]
>> I've encountered the problem that, while at least I can search for
>> text smaller than a cluster, there's no indication in the window of
>> where in the window the text is.
>
> I could imagine Emacs decomposing characters temporarily when only
> part of a cluster matches the search string.  Assuming this would make
> sense to users of some complex scripts, that is.  You are welcome to
> suggest such a feature by using report-emacs-bug.

That's what I do in my emacs with combining characters, and if I had
complex script support, I'd expect the same to happen there.
emacs is a programmer's editor, after all :)

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Proposal to add standardized variation sequences for chess notation

2017-04-12 Thread Julian Bradfield via Unicode
On 2017-04-12, Philippe Verdy via Unicode  wrote:
> 2017-04-12 8:35 GMT+02:00 Martin J. Dürst :
>> On Go boards, the grid cells are definitely rectangular, not square. The
>> reason for this is that boards are usually looked at at an angle, and
>> having the cells be higher than wide makes them appear (close to) square.
>> However, because diagrams are usually viewed at close to a right angle, Go
>> diagrams use squares, not rectangles.
>
> That's not a valid reason.  "Go" uses **square** cells not **rectangles***
> because of the form of the pieces (round) and the fact they must nearly
> touch each other to surround other pieces.

I don't think Go players and board makers have any interest in your
views of valid reasons.
According to the information provided by various national Go
societies, the typical Japanese Go cell is 22mm by 23.6mm, for the
reason Martin stated.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Standaridized variation sequences for the Desert alphabet?

2017-03-27 Thread Julian Bradfield
While I hesitate to dive in to this argument, Martin makes one comment
where I think a point of principle arises:

On 2017-03-27, =?UTF-8?Q?Martin_J._D=c3=bcrst?=  wrote:
[Michael wrote]
>> You know, Martin, I *have* been doing this for the last two decades. I’m 
>> well aware of what a font is and can do.
>
> Great. So you know that present-day font technology would allow us to 
> handle the different shapes in at least any of the following ways:
>
> 1) Separate characters for separate shapes, both shapes in same font
> 2) Variant selectors, one or both shapes in same font
> 3) Font features (e.g. 1855 vs. 1859) to select shapes in the same font
> 4) Font selection, different fonts for different shapes
>
> Does that knowledge in any way suggest one particular solution?


As I've observed before, the intention is that we are stuck with
Unicode for as long as our civilization endures, be that 5000 years or
50 years.

I contend, therefore, that no decision about Unicode should take into
account any ephemeral considerations such as this year's electronic
font technology, and that therefore it's not even useful to mention
them.

All you should need to say is "these letters are too insignificant to
merit encoding, and those who believe they need to be able to
distinguish them in plain text will just have to use other means, such
as ZWJ with the components of the ligature".

(I'm not saying that's my view, by the way - I'm more of a splitter
than a lumper, and on the basis of this thread, I'm probably on the
"encode" side.)

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Combining solidus above for transcription of poetic meter

2017-03-17 Thread Julian Bradfield
On 2017-03-17, Philippe Verdy <verd...@wanadoo.fr> wrote:
> 2017-03-17 18:27 GMT+01:00 Julian Bradfield <jcb+unic...@inf.ed.ac.uk>:
>
>> If you are happy to use a typographically normal combining breve for
>> the unstressed syllables, you should be happy to use a typographically
>> normal acute accent for the stressed syllable.
>>
>
> You've understood the reverse! the stressed syllable in those notation uses
> a breve, the unstressed syllables use a slash/solidus (which many look very
> similar to an acute accent, but means here exactly the opposite).

I have understood the situation as it actually is (and indeed as it is
described in the Wikipedia article). *As I pointed out*, had you
bothered to read what I wrote, the OP accidentally reversed the
standard notation, in which / indicates a stressed syllable, and a
breve an unstressed.

Hence there is no clash with the (e.g.) Spanish use of an acute to
indicate stress.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Combining solidus above for transcription of poetic meter

2017-03-17 Thread Julian Bradfield
On 2017-03-17, Rebecca T <637...@gmail.com> wrote:
> When transcribing poetic meter (scansion
>), it is common to use two symbols
> above the line (usually a breve [U+306  ̆] for stressed syllables and a
> solidus
> / slash [U+2F /] for unstressed syllables) to indicate stress patterns. Ex:

Other way round, as you illustrate

> This approach, however, is problematic; the lack of a combining slash above
> character means that two lines of text must be used, and any non-monospaced
> font (or any platform where multiple consecutive spaces are truncated into
> one

It won't help to have a "combining solidus a long way above" (which is
what you really want) unless you also have "combining breve a long way
above".
If you are happy to use a typographically normal combining breve for
the unstressed syllables, you should be happy to use a typographically
normal acute accent for the stressed syllable.

> by default, such as HTML) makes keeping the annotations properly aligned
> with
> the text difficult or impossible — depending on your email client, the
> above
> example may be entirely misaligned. Being able to use combining diacritics
> for
> scansion would make these problems obsolete and enable a semantic
> transcription
> of meter.

If you're working in a situation where you don't have either markup
control or the facility to use plain monospaced text, then just use
normal breves and acutes.
It's not clear to me that laying out aligned text (for which there are
many other applications than scansion, e.g. interlinear translation)
is something best achieved with combining characters!




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: The (Klingon) Empire Strikes Back

2016-11-08 Thread Julian Bradfield
On 2016-11-08, Mark E. Shoulson  wrote:
> I've heard that there are similar questions regarding tengwar and cirth, 
> but it is notable that UTC *did* see fit to consider this question for 
> them and determine that they were worthy of encoding (they are on the 
> roadmap), even though they have not actually followed through on that 
> yet, perhaps because of these very IP concerns.  Notably, pIqaD is not 

The Tolkien Estate considers that the tengwar constitute a work of
art, and it's not willing to see them in Unicode, because this would
hinder its ability to pursue people using tengwar for what it
considers inappropriate purposes. (I finally asked them a couple of
years ago for permission to encode, based on Michael Everson's draft
proposal from yonks ago, and that's the summary of their reply.)

Several years ago, I was told on this list that it would be up to the
proposers to deal with this, and that the Unicode Consortium would
have no interest in taking on the 800lb legal gorilla that is the
Tolkien Estate. (Now a 24M£ gorilla with what it got from New Line
Cinema.)

If some wealthy Unicode Consortium member feels like paying for an
American counsel's opinion that the Estate is just trying it on, feel
free!

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Why incomplete subscript/superscript alphabet ?

2016-10-11 Thread Julian Bradfield
On 2016-10-10, Hans Åberg  wrote:
> There are others, for example, in Dutch, the letter "v" and in "van"
> is pronounced in dialects in continuous variations between [f] and
> [v] depending on the timing of the fricative and the following
> vowel.

Continuous variation is a universal truth of language.
The IPA has mechanisms for describing crude differences in voicing,
but if you're working at the level of, say, a difference between 0 ms
and 20 ms in average voice onset time, you need to be using numbers and
instruments, not symbols and the ear.

The most extreme attempt I know to extend the IPA to fine phonetic detail
is Canepari's book, with lots of symbols not in Unicode (I
think...it's a long whlie since I looked at). It's completely ignored,
because the level of detail he attempts to represent is well beyond
the reproducible abilities of phoneticians unaided by acoustic
analysis.

> It has become popular in some dictionaries to use [d] in the
> AmE where the BrE uses [t], but when listening, it sounds more like
> a [t] drawn towards [d].

Are you talking about American flapping, where a /t/ between vowels is
realized as [ɾ]? I'd be surprised if any very serious dictionaries
use  to represent that - can you give an example?

> One does not really speak separate consonants and vowels, but they slide over 
> and adapt. Describing that is pretty tricky.

This is also a universal truth of language! But it doesn't stop us
making sensible abstractions, and notating them symbolically.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Julian Bradfield
On 2016-10-10, Michael Everson <ever...@evertype.com> wrote:
> On 10 Oct 2016, at 21:58, Julian Bradfield <jcb+unic...@inf.ed.ac.uk> wrote:
>> That's an interesting use of "proprietary" you have there, but I

> You have to have special knowledge and special software to use it.

That's not what "proprietary" means. To quote the OED (which, by the
way, is produced by an actual professional publisher, and is stored in
XML, unless I'm badly mistaken), "proprietary" means "Of a product,
esp. a drug or medicine: of which the manufacture or sale is
restricted to a particular person or persons; (in later use)
spec. marketed under and protected by patent or registered trade
name."
If you're typesetting your bible with no special software and no
special knowledge, then you must be doing it by hand in cold
metal. Somehow, I don't think you are.
I suspect you're using software that is owned by somebody and marketed
and protected.


> Apparently it’s used to good effect in mathematics, though a great
> deal of TeX material appears printed and has an obvious “TeX” feel

It's for printing, so of course it appears printed. The obvious TeX
feel is the result of using the default style, which arises from
Knuth's personal taste in mathematical typesetting, with Lamport's
(abominable) taste in structural layout on top. There are tens of
thousands of journals and books produced with LaTeX, in hundreds or
thousands of styles.

Among publishers you may have heard of, Addison-Wesley, CUP, Elsevier,
John Benjamins, OUP, Princeton UP, Wiley all use LaTeX for a
significant proportion of their output. They're all professionals.

> “Properly”, sayeth the computer programmer. Sorry, Julian, but I use 
> professional tools to typeset, and your disdain for that process isn’t going 
> to change that industry. This “suitable markup” business you’re talking about 
> is not something people outside of ivory towers actually use. 

You're a dilettante publisher using low-end professional graphic
design tools to publish. Indesign, for example, is far easier to use
for far greater effect than any LaTeX-based system if you're producing
magazines or posters; but it's far worse if you care about the content.

> That’s not using Unicode for a hack. That’s using Unicode to preserve 
> distinctions in plain text. 

Only because you've a priori decided that superscripts are plain
text instead of extra-textual decorations.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Julian Bradfield
On 2016-10-10, Hans Åberg <haber...@telia.com> wrote:
>> On 10 Oct 2016, at 22:15, Julian Bradfield <jcb+unic...@inf.ed.ac.uk> wrote:
>> What do you mean? The IPA in narrow transcription is intended to
>> provide as detailed a description as a human mind can manage of
>> sounds. It doesn't care whether you're describing differences between
>> languages or differences within languages (a distinction that is not
>> in any case well defined).
>
> It is designed for phonemic transcriptions, cf.,
> https://en.wikipedia.org/wiki/History_of_the_International_Phonetic_Alphabet

It *was* designed, in 1870-something. Try reading the Handbook of the
IPA. It contains many samples of languages transcribed both in a broad
phonemic transcription appropriate for the language, and in a narrow
phonetic transcription which should allow a competent phonetician to
produce an understandable and reasonably accurate rendition of the
passage. Indeed, a couple of decades ago, I participated in a public
engagement event in which a few of us attempted to do exactly that.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Julian Bradfield
On 2016-10-10, Michael Everson  wrote:
> I can’t use LaTeX notation. I don’t use that proprietary system. And don’t 
> you dare tell me that I am benighted, or using Word. Neither applies.

That's an interesting use of "proprietary" you have there, but I
suppose with your Alician interests, Humpty Dumpty's attitude to words
may have rubbed off on you! What *do* you mean?

> I have an edition of the Bible I’m setting. Big book. Verse numbers. I like 
> these to be superscript so they’re unobtrusive. Damn right I use the 
> superscript characters for these. I can process the text, export it for 
> concordance processing, whatever, and those out-of-text notations DON’T get 
> converted to regular digits, which I need.

If you were doing it properly, the text would be stored in a suitable
markup, as would the verse numbers, and both the typesetting and the
concordance processing would deal with them appropriately.
No need for Unicode hacks.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Julian Bradfield
On 2016-10-10, Philippe Verdy  wrote:
> Not relevant! Here were'e not speaking about punctuation between words, but
> inclusion within words in phonetic trancrtiptions where even word
> separation is not always relevant and punctuation us almost absent.
> There's no case in Spanish with "¡" in the middle of a word. But here we're
> speaking about noting a consonant within words where vowels can also be
> expected in phonetic transcriptions. And there the confusion with a
> following voiwel i is very likely. On the opposite, IPA symbols are
> carefully chosen to avoid visual confusions (and that's why they only exist
> in a single lettercase).

 and <¡> are less confusable than  and <ɪ>,
especially in a sanserif font. In both cases, the main visual cue is
a descender/ascender in one letter than isn't in the other.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Julian Bradfield
On 2016-10-10, Hans Åberg  wrote:
> It is possible to write math just using ASCII and TeX, which was the original 
> idea of TeX. Is that want you want for linguistics?

I don't see the need to do everything in plain text. Long ago, I spent
a great deal of time getting my editor to do semi-wysiwyg TeX maths
(work later incorporated into x-symbol), but actually it's a waste of
time and I've given up. Working mathematicians know LaTeX and its control
sequences. Even my 12-year old uses LaTeX control sequences to
communicate with his online maths courses.

Because phonetics has a much small set of symbols, I do kwəɪt ləɪk
biːɪŋ eɪbl tʊ duː ðɪs, and because they're also used in non-specialist
writing, it's useful to have the symbols hacked into Unicode instead
of hacked into specialist fonts.
But subscripts? No need.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Julian Bradfield
On 2016-10-10, Michael Everson <ever...@evertype.com> wrote:
> On 10 Oct 2016, at 21:04, Julian Bradfield <jcb+unic...@inf.ed.ac.uk> wrote:
>> 
>> Linguists don't need internationalization. They use IPA or other notations. 
>
> We need reliable plain-text notation systems. Otherwise distinctions we wish 
> to encode may be lost. 

We have no need to make such distinctions in "plain text".

It's convenient to have major distinctions easily accessible without
font hacking, but there's no need to have every notation one might
dream up forcibly incorporated into "plain text".
In particular, for super/subscripts, which is where we came in, even
the benighted souls using Word still typically recognize and can use
LaTeX notation.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Julian Bradfield
On 2016-10-10, Hans Åberg  wrote:
>> On 10 Oct 2016, at 21:42, Doug Ewell  wrote:
>> Hans Åberg wrote:
>>> I think that IPA might be designed for broad phonetic transcriptions
>>> [1], with a requirement to distinguish phonemes within each given
>>> language.
...
>> IPA can be used pretty much as broadly or as narrowly as one wishes.
>
> Within each language, but is not designed to capture differences between 
> different languages or dialects.

What do you mean? The IPA in narrow transcription is intended to
provide as detailed a description as a human mind can manage of
sounds. It doesn't care whether you're describing differences between
languages or differences within languages (a distinction that is not
in any case well defined).

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Julian Bradfield
On 2016-10-10, Philippe Verdy <verd...@wanadoo.fr> wrote:
> 2016-10-10 18:04 GMT+02:00 Hans Åberg <haber...@telia.com>:
>> > On 10 Oct 2016, at 15:24, Julian Bradfield <jcb+unic...@inf.ed.ac.uk>
>> wrote:

>> > The alveolar click with percussive flap hasn't made it into the
>> > standard IPA, but in ExtIPA it's [ǃ¡] (preferably kerned together).
>>
>> There is ‼ DOUBLE EXCLAMATION MARK U+203C which perhaps might be used.

> I disagree, IPA does not use such confusive ligature that would be read as
> a repeated click and not a single one. Reversing the second one (and
> slighly kerning it, thow I don't know how, to avoid the confusion with
> "!i", i.e. a click followed by a vowel, most proably writing them on top of
> each other or slanted/italicized) is a valuable visual distinction for a
> single distinctive phoneme.

What confusion? ¡ is not easily confusable with i - ask the Spanish!

> But IPA also proposes something else when more precise distinctions are
> needed for noting not just the linguistic phonemes but their precise

Did you read the bit where I said that?

> Clicks are also pronouncable by themselves in isolation without any vowel
> (in fact it's much easiler to pronounce them without a vowel) but they may
> easily be pitched (on a small range of about 6 or 7 musical tones) instead
> of being vovalized. However I've not seen any discritics to also annote the
> pitch.

Because no language uses clicks this way, and phonetic alphabets are
not written for composers of mouth music. If one wished to do so, one
would use the standard tone indicators.

> In Chinese vowels are annotated with distinctive tones (but some of them
> variable, where clicks can hardly have a raising or lowering tone). The
> pitch is easily realized by more or less opening the mouth or by slighly
> closing lip or rounding them (giving an appearence of "vowel", though they
> are not voiced through the mouth as they are usually "aspirated" there, but
> only voiced within air expirated through nasal areas). All this looks like

What are you on about?

> technical possibilities of human voice, appropriate for phonetic analysis
> but rarely for actual phonemes of languages as they are hard to be
> distinguished in a group of people.

Those who learn languages natively have no problems distinguishing
voiced, voiceless, aspirated, breathy, nasal, glottalized,... clicks.

> These distonctions are however easiler to recognize within the context of a
> complete speach along with other surrounding phonemes (Chinese may be
> realized on 6 or 7 musical pitch tones by any one, but in speach only 3 are
> used and the other phonemic tones are combination of the 3 basic tones, and

(a) There is no such thing as "Chinese" - there are many different
languages in China, with a continuum of dialect gradations.
(b) Even if you mean Mandarin, the usual notation for the five (four
plus neutral) Mandarin tones uses five pitch levels to describe
the contours, not three.

> spacing modifiers (and in Pinyin, they are frequently noted with standard
> European digits but have no direct relation with the musical pitch tone or
> even with the 3 basic pitches used to compose the phonemic tones). Chinese
> (but also Vietnamese) may also use diacritics above (acute, grave,
> circumflex, tilde...). Linguists needing internationlization use distinct
> symbols written after the vocalic phoneme or just after a vowelless
> consonnantal phoneme, or just after a neutral schwa for a neutral/unclear
> vowel.

Linguists don't need internationalization. They use IPA or other
notations. 

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Julian Bradfield
On 2016-10-10, Hans Åberg <haber...@telia.com> wrote:
>> On 10 Oct 2016, at 15:24, Julian Bradfield <jcb+unic...@inf.ed.ac.uk> wrote:
>> The alveolar click with percussive flap hasn't made it into the
>> standard IPA, but in ExtIPA it's [ǃ¡] (preferably kerned together).

> There is ‼ DOUBLE EXCLAMATION MARK U+203C which perhaps might be used.

!! was used by one famous Africanist, but that was before ExtIPA existed.

> The preceding discussion was dealing additions to Unicode one-by-one—the 
> question is what might be added so that linguists do not feel restrained.

Linguists aren't stupid, and they have no need for plain text
representations of all their symbology. Linguists write in Word or
LaTeX (or sometimes HTML), all of which can produce a wide range of
symbols beyond the wit of Unicode.

As I have remarked before, I have used "latin letter turned small
capital K", for reasons that seemed good to me, and I was not one whit
restrained by its absence from Unicode - nor was the journal.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Julian Bradfield
On 2016-10-10, Hans Åberg  wrote:
> I think that IPA might be designed for broad phonetic transcriptions
> [1], with a requirement to distinguish phonemes within each given
> language. For example, the English /l/ is thicker than the Swedish,
> but in IPA, there is only one symbol, as there is no phonemic
> distinction with each language. The alveolar click /!/ may be
> pronounced with or without the tongue hitting the floor of the
> mouth, but as there is not phonemic distinction within any given
> language, there is only one symbol [2]. 

But the IPA has many diacritics exactly for this purpose.
The velarized English coda /l/ is usually described as [l̴]
with U+0334 COMBINING TILDE OVERLAY, or can be notated [lˠ]
with U+02E0 MODIFIER LETTER SMALL GAMMA.

The alveolar click with percussive flap hasn't made it into the
standard IPA, but in ExtIPA it's [ǃ¡] (preferably kerned together).

> Thus, linguists wanting to describe pronunciation in more detail are left at 
> improvising notation. The situation is thus more like that of mathematics, 
> where notation is somewhat in flux.

There is improvisation when you're studying something new, of course,
but there's a lot of standardization.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Fwd: Why incomplete subscript/superscript alphabet ?

2016-10-08 Thread Julian Bradfield
On 2016-10-07, Oren Watson  wrote:
> I scarcely think that a use case was submitted for every one of the
> blackboard bold etc letters in the mathematical set; merely the use of
> blackboard bold for a general purpose of denoting sets such as the
> naturals, reals, complex numbers etc, and the fact that arbitrary letters
> might be used if a mathematician desired, seems to have sufficed.

Indeed. I happen to think the whole math alphabet thing was a dumb
mistake. But even if it isn't - and incidentally in some communities
there is or was a convention of using blackboard bold letters for
matrices, which justifies all of them -:

> I believe the same logic applies to the case of linguistics, where the use
> of superscripts are a common convention.

Either superscripts are being used mathematically, in which case you
can use mathematical markup, or they're being used with very specific
semantics, as in the phonetic modifier letters. For the latter case,
there is a standard. First you get your letter recognized by the IPA,
then you encode it. The IPA doesn't recognize arbitrary superscripts.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



non-breaking snakes

2016-05-04 Thread Julian Bradfield
See
http://xkcd.com/1676/
(making sure to look at the mouse-over text)

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Joined "ti" coded as "Ɵ" in PDF

2016-03-19 Thread Julian Bradfield
On 2016-03-19, Don Osborn  wrote:
> The details may or may not be relevant to the list topic, but as a user
> of documents in PDF format, I fail to see the benefit of such obscure
> mappings. And as a creator of PDFs ("save as") looking at others' PDFs

Aren't you just being bitten by history? PDF derives from PostScript,
which is not a language for representing plain text with typesetting
information, but a language for type(and-graphic-)setting tout court.
There's a lot of history of fonts using arbitrary codepoints; the idea
that the underlying strings giving rise to the displayed graphics
should also be a good plain text representation of the information is
relatively novel.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Unicode in the Curriculum?

2015-12-31 Thread Julian Bradfield
On 2015-12-31, Andre Schappo  wrote:

> I have been hitting my head against the Academic Brick Wall for
> years WRT getting IT i18n and Unicode on the curriculum and I am
> losing. I did teach a final year elective module on IT i18n but a
> few months ago my University dropped it. I am continually puzzled by
> the lack of interest University Computer Science departments have in
> i18n. I appear to be a solitary UK University Computer Science voice
> when it comes to i18n. 

Well, I'd say that it's not the business of Computer Science degrees
to teach specific technical skills. It's our business to help people
learn about the fundamentals of the subject, so that they can acquire
any specific skill on demand, and use that skill competently. In those
areas where we do teach specific skills (e.g. machine learning
techniques) we teach those that have some intellectual content to
them.  (This is why we don't teach programming languages as such - we
teach a programming language as a means of learning a programming
paradigm.)

In my experience so far, using Unicode and doing i18n is not very
interesting (killingly boring, actually) from a purely CS technical
point of view, unless you happen to be one of the small minority who
enjoys script and font layout issues - the interesting bits of doing
i18n are in producing linguistically and culturally appropriate
messages, and that's where one should bring in experts, not expect
typical software developers to be able to do it.

If you still have the materials for your course, it would be
interesting to see how you managed to get an interesting (and
examinable!) course out of i18n.

I do in fact mention Unicode and i18n in my introductory programming
course (which is not for CS students), but all I say is "you should
know it's there, and if you become a competent programmer, then you
can read the manuals and tutorials to learn what you need".

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Unicode in passwords

2015-10-07 Thread Julian Bradfield
On 2015-10-06, Philippe Verdy  wrote:
> I was speaking of OUTPUT fields : you want to display passwords that are
> stored somewhere (including in a text document stored in some safe place
> such as an external flash drive). People can't remember many passwords.

Again, output fields (such as in the Firefox password manager), in my
experience, display the text that is in them, not a stripped and
compressed version. If they don't, it's a bug.
If you start using passwords including NBSP and EM-DASH, then it's
going to get a bit awkward - but you should know you're doing that,
and take measures accordingly.

> Hiding them on screen is a fake security, what we need is complex passwords
> (difficult to memoize so we need a wallet to store them but people will
> also **printing** them and not store them in a electronic format), and many

It's questionable whether there is ever a need to print a password,
except in the case of an automatically generated hard-copy password
reset. My digital will (if I'd produced one) would need about half a
dozen passwords, mainly the master password for the password manager,
plus some sensitive finance and system admin ones. That's few enough
to write down by hand (or type by hand into a text file), with
appropriate notes.

> passwords (one for each site or application requiring one). But they also
> want to be able to type them correctly: long passwords hidden on screen

Most of our students seem (when I see them logging in to give
presentations) to have long passwords - 20-30 characters - and they
don't seem to have a problem. This also illustrates why defaulting to
hidden passwords is useful.

> Biometric identification is also another fake security (because it is

Not sure what this has to do with Unicode in passwords.

> immutable, when passwords can be and should be changed regularly) and it is

Bruce Schneier is one of the best known and most respected security
researchers around today, and here's his advice:

  So in general: you don't need to regularly change the password to
  your computer or online financial accounts (including the accounts
  at retail sites); definitely not for low-security accounts. You
  should change your corporate login password occasionally, and you
  need to take a good hard look at your friends, relatives, and
  paparazzi before deciding how often to change your Facebook
  password. But if you break up with someone you've shared a computer
  with, change them all. 

( https://www.schneier.com/blog/archives/2010/11/changing_passwo.html )


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Unicode in passwords

2015-10-06 Thread Julian Bradfield
On 2015-10-06, Asmus Freytag (t)  wrote:
> All browsers I use display spaces in input boxes, and put blobs for
> hidden fields. Do you have evidence for broken input fields?
> 
> 
> Network keys. That interface seems to consistently give people a
> choice to reveal the key.

? That's not broken in the way Philippe was discussing.

> Copy-paste works on all my systems, too - do you have evidence of
> broken copy-paste in this way?
> 
> 
> I've seen input fields where sites don't allow paste on the second
> copy (the confirmation copy).
> 
> Even for non-password things.

That's not relevantly broken, either - it's a design feature, to make
sure you can type the password again (from finger memory!).

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Unicode in passwords

2015-10-06 Thread Julian Bradfield
On 2015-10-06, Philippe Verdy  wrote:
> Finally note that passwords are not necessarily single identifiers
> (whitespaces and word separators are accepted, but whitespaces should
> require special handling with trimming (at both ends) and compression of
> multiple occurences.

Why would you trim or compress whitespace? Using multiple spaces seems a
perfectly legitimate way of making a password harder to guess.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Unicode in passwords

2015-10-06 Thread Julian Bradfield
On 2015-10-06, Philippe Verdy  wrote:
> I don't think it is a good idea for tectual passwords to make differences
> based on the number of spaces. Being plain text they are likely to be
> displayed in utser interfaces in a way that the user will not see. Without

This is true of all passwords. Passwords have to be typed by finger
memory, not by looking at them (unless you're the type who puts them
on sticky notes, in which case you type by looking at the text on the
note). One doesn't normally see the characters, at best a count of
characters.

> trimming, users won't see the initial or final space, and the password
> input method may not display them as well (e.g. in an HTML input form or

All browsers I use display spaces in input boxes, and put blobs for
hidden fields. Do you have evidence for broken input fields?

> when using a button to generate passphrases that users must then copy-paste
> to their password manager or to some private text document).

Copy-paste works on all my systems, too - do you have evidence of
broken copy-paste in this way?

> Some password
> storages also will implicitly trim and compress those strings (e.g. in a

If it compresses it on setting, but doesn't compress it on testing, or
vice versa, then that's a bug. If it does the same for setting and
testing, it doesn't matter (except to compromise the crack-resistance
of the password).

> fixed-width column of a table in a database). There's also frequently no
> visual hint when entering or displaying those spaces and compression occurs

Evidence? Maybe if you're typing a password into a Word document it's
hard to count spaces, but why would you be doing that?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Square Brackets with Tick

2015-08-22 Thread Julian Bradfield
On 2015-08-22, Nigel Small ni...@nigelsmall.com wrote:
 298D; 2990; o # LEFT SQUARE BRACKET WITH TICK IN TOP CORNER
 298E; 298F; c # RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
 298F; 298E; o # LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
 2990; 298D; c # RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER

 with several code points in between. According to the code point pairs in
 the first and second columns of this file, these particular brackets should
 be paired as the *first and fourth* and the *third and second*. Intuitively
 however, these would actually be *first and second* and *third and fourth*
 if one is to expect consistency.

That's a strange intuition! Mathematical brackets are expected to pair
with left-right symmetry, not rotational symmetry. As in, for example,
floor and ceiling brackets. The pairing in the file is the natural one.

 1. The current pairing information is correct and the sequence is irregular
 for some historical reason

That will be the explanation. There is no inherent meaning to the
order of codepoints, it's just convenience.
One of the experts here can probably tell us why these four brackets
happen to be coded in this order.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Windows keyboard restrictions

2015-08-06 Thread Julian Bradfield
On 2015-08-06, Richard Wordingham richard.wording...@ntlworld.com wrote:
 That depends on the availability of Tavultesoft Keyman.  The UK has been
 discussing whether a certain user-perceived character should be encoded
 as a single character in a new script.  Users ought to have this
 character on their keyboards, but there is a worry about technical
 problems if it is encoded as a sequence of three characters, i.e. six
 UTF-16 code units.  If Windows easily supports a ligature of six UTF-16
 code units, then one argument for encoding it is eliminated.

Unicode is supposed to be for the (sadly probably rather short) life
of human civilization, until we have no more need for text. Using an
ephemeral property of an ephemeral operating system for ephemeral
computers in an encoding argument makes no sense.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



a mug

2015-07-11 Thread Julian Bradfield
I feel the following mug says something about a popular topic of
debate on this list...


http://www.redbubble.com/people/insider/works/15315362-i-3-unicode

(do look at the picture, don't just infer from the url)

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Tag characters and in-line graphics (from Tag characters)

2015-06-02 Thread Julian Bradfield
On 2015-06-02, William_J_G Overington wjgo_10...@btinternet.com wrote:
  take place if people on this mailing list feel that it is a good 
 solution to the problem raised in section 8 of the following document.
 http://www.unicode.org/reports/tr51/tr51-2.html

That section does not raise a problem. It says what the solution to
the emoji problem is: namely that people who want to embed graphics in
text should fix their protocols to allow it, instead of subverting
Unicode to do it.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp

2015-03-29 Thread Julian Bradfield
On 2015-03-29, Johnny Farraj johnnyfar...@yahoo.com wrote:
 Michael,
 Thanks for the swift response, and your interest.
 Your collaboration is greatly appreciated.
 Do you have any experience in submitting new Unicode character proposals?
 And/or with creating the reference copy  of a symbol in the format required?

I think Michael should print this out and frame it, as a modern equivalent
of the slave muttering memento mori ...



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Why is BN weak?

2015-01-02 Thread Julian Bradfield
I've been perusing the Bidi Algorithm, and I am wondering why the
Boundary Neutral class is classified as a weak class rather than a
neutral class. Can somebody explain?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Colour font, color font, colourfont, colorfont

2014-03-14 Thread Julian Bradfield
On 2014-03-14, Alex Plantema alex.plant...@xs4all.nl wrote:
 Colouri(z|s)e isn't in my dictionary; colour is already a verb as well.

Get a better dictionary. The word has been in the language more than
four hundred years. It currently has a fairly common technical meaning
of adding colour to old monochrome photos or films. In any case, you
don't need a dictionary, because -ize is a productive formation.

 Btw, font is spelled fount in British English.

Suggest you don't propound on languages other than your own.
That used to be true in the days of metal type, although even so both
spellings have been in use through the last few centuries. Then,
fount was a technical term that few people would have cause to use.
With the advent of computers, the font spelling has completely
supplanted the fount spelling in everyday usage.
Within the industry, some current British letterpress printers use
fount for metal type and font for digital type, while others use
font for both.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Armenian typography, letter t

2013-10-19 Thread Julian Bradfield
This isn't really a Unicode question at all, but I know this list is
probably the best place to find the answer;-)

I'm putting a snippet of Armenian into a lecture slide, and the font
easily available to me renders the letter տ ARMENIAN SMALL LETTER TIWN
with a descending middle vertical, as indeed does the font in which my
editor is currently displaying.
However, the fonts my browser uses, and the few examples of printed
Armenian I have to hand, don't have a descender.

Is this a completely free stylistic variation, or does the choice have
any connotations? (I imagine it's to make it more distinct from ա AYB.)


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Posting Links to Ballots (was: RE: Why blackletter letters?)

2013-09-12 Thread Julian Bradfield
On 2013-09-11, Whistler, Ken ken.whist...@sap.com wrote:

[ lots ]

Thank you for that explanation!

 Draft additional repertoire for ISO/IEC 10646:2014 (4th edition) (WG2 N4459)
 http://www.unicode.org/L2/L2013/13151-n4459.pdf

Interesting. I see that disunification of the remaining IPA greek
letters is proceeding by stealth - we have latin chi thanks to German
dialectologists, and latin beta thanks to Gabonese. My question is,
why should they not be used for IPA ?
Now all we need is latin theta and latin upsilon (proper one, rather
than the bizarrely named ʊ) and we're done!


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




IPA Greek (was Re: Posting Links to Ballots (was: RE: Why blackletter letters?))

2013-09-12 Thread Julian Bradfield
On 2013-09-12, Michael Everson ever...@evertype.com wrote:
 On 12 Sep 2013, at 09:07, Julian Bradfield jcb+unic...@inf.ed.ac.uk wrote:
 Interesting. I see that disunification of the remaining IPA greek letters is 
 proceeding by stealth -

 No, Julian. It's by design. Only theta remains. 

Hm, that's not what the comments in some of the working documents
suggest:-) not intended for use with the IPA for chi.

 we have latin chi thanks to German dialectologists, and latin beta thanks to 
 Gabonese. My question is,
 why should they not be used for IPA ?

 I think they should. I will be taking this up with the Association. 

Then we have the problem that LATIN SMALL LETTER CHI seems to be (as
originally named) a stretched x, which is what the uvular fricative
sign *ought* to look like to be properly harmonious, but the IPA seems
determined that it looks like a upright greek chi - wrong stroke bias,
no roman serifs (in the current version), swung terminations to the
TL-BR stroke. 
You describe this in your web page, but I'm not sure what you think
the reference glyph should be: did the dialectologists use a true
stretched x? I have tried using a stretched x in my transcriptions,
and I have to say it looks weird!

 No, just theta. The bizarrely-names Latin ʊ is already in use by the 
 Association. 

Very true. Somehow I hadn't noticed that ʋ was there - and also
bizarrely named, since as PSG observes, it looks much more like upsilon
than ʊ does. Why it was called V WITH HOOK rather than SCRIPT V? Was
it for Africanist reasons?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: IPA Greek

2013-09-12 Thread Julian Bradfield
On 2013-09-12, Michael Everson ever...@evertype.com wrote:
 Further clarification on this point was published in 
 http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4296.pdf

Thanks, that rather more than answers everything...

 Somehow I hadn't noticed that ʋ was there - and also bizarrely named, since 
 as PSG observes, it looks much more like upsilon than ʊ does. Why it was 
 called V WITH HOOK rather than SCRIPT V? Was it for Africanist reasons?

 028A is ʊ LATIN SMALL LETTER UPSILON 
 028B is ʋ LATIN SMALL LETTER V WITH HOOK

 These are used for different sounds. I'm not sure that either name is 
 particularly bizarre. 

I know what they *mean*.
The name V WITH HOOK is strange because there is no hook in ʋ, in
any of the several other senses that HOOK is used in IPA character
names, or in any reasonable typographic sense. ʋ is usually drawn as
an upright italic v, possibly with the italic instroke at the left
replaced by a roman serif, but leaving the usual italic bulb
termination on the right. Which bit does the HOOK refer to?
Elsewhere, HOOK refers to, well, a hook stuck on to a bit of a letter,
be it the implosive hook, retroflex hook, palatal hook or rhotic hook.

The other is strange because whatever the origin of the character, it
looks like a turned small capital omega.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-06-20 Thread Julian Bradfield
On 2013-06-19, Richard Wordingham richard.wording...@ntlworld.com wrote:
 The X11 restriction of one character per key stroke is not so easy to
 get round.  Some applications don't cooperate with work-arounds such as
 ibus, and I find ibus unreliable enough that I want alternative methods
 for when it fails.

Can't good old-fashioned input methods handle this?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Rendering Raised FULL STOP between Digits

2013-03-10 Thread Julian Bradfield
On 2013-03-10, Richard Wordingham richard.wording...@ntlworld.com wrote:
 The question is what users will demand. Expectations have been low
 enough that the loss of decimal points has been accepted.
 Additionally, striving for an apparently hard to get raised decimal
 point risks being forced to use an achievable decimal comma.

It's also true, isn't it, that even in Britain the raised point has
been discouraged in scientific contexts for a long time? I can't find
a reference right now, but I think IUPAC and IUPAP prefer a low point;
and my science textbooks, even from the 60s, use a low point -
whereas some of my maths textbooks from the 70s use a raised point (at
half M-height), which now looks odd to me, because I use the mid-point
multiplication a lot in my work.
But my son's school is still teaching a raised point in hand-writing.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: German »ß«

2013-02-17 Thread Julian Bradfield
On 2013-02-17, Philippe Verdy verd...@wanadoo.fr wrote:
 I was not citing empirical results but things that are regulated by 
 legislation.

No you weren't - you were making explicit claims that lowercase is
harder to read than capitals. You said nothing about regulation.

 And your existing empirical results are just nfomal tests ignoring
 important parts of the population of drivers, notably:

Since you aren't even aware of the existence of these reports (the
Anderson and Worboys reports in the UK, and equivalents in the US and
Germany) , it's quite impressive that you know what's in them.
As one can read, the recent enforced change in the U.S. to lower-case
placenames on all signs is significantly driven by the increasing
number of elderly drivers with poorer sight.
The changes in the U.S. follow a program of research (for example by
Philip Garvey, a psychologist of vision) commissioned by the agencies
on how to make signs more readable for these drivers, amongst others.

 compensated by sufficient contrast (lowercase letters do not contrast
 enough, because their strokes are too near of each other)

I think perhaps you should look at some letter forms, particularly in
the typefaces used for traffic signs.

 - the effect of presbytia on vision of aging population : here again
 the size of letters does matter (look at those phones sold to ages

Road signs are usually not in front of one's nose!

 In all these cases, you need less density of strokes, and capital
 letters are better constrasting.

Could you point to anyone who has found this to be true in reality?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: German »ß«

2013-02-16 Thread Julian Bradfield
On 2013-02-16, Philippe Verdy verd...@wanadoo.fr wrote:
 2013/2/16 Stephan Stiller stephan.stil...@gmail.com:
 Of course in my worldview, all-caps writing is deprecated :-)

 This is a presentation style which makes words more readable in some
 conditions, notably on plates displayed on roads (cities are extremely
 rarely written in lowercase, as this is more difficult to read from
 far away when driving). 

Half a century ago, the UK, after extensive empirical testing,
mandated mixed case for road signs because it is significantly easier
to read at speed.
Our cousins across the Atlantic have finally caught on, and 
the U.S. Federal Highway Administration now mandates mixed case for
place names, while leaving fixed wording in all caps.



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: German »ß«

2013-02-16 Thread Julian Bradfield
On 2013-02-17, Philippe Verdy verd...@wanadoo.fr wrote:
 True lowercase letters are causing problems on road sign indicators on
 roads with high speed : they are hard to read and if the driver has to
 look at them for one more second, he does not look at the road.

AS I SAID, empirical evaluation by those who had good cause to care
about the issue indicates the opposite, that people take longer to
read all caps (as is also the case in normal text).
This evaluation was done specifically for high speed roads. It
included live testing on one motorway.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Public Review Issue 232 Proposed Update UAX #9, Unicode Bidirectional Algorithm (Copy of email sent to the list; also posted by me to unicode feedback/public review issue-- but this has not yet po

2013-02-03 Thread Julian Bradfield
On 2013-02-02, Stephan Stiller stephan.stil...@gmail.com wrote:
 And sometimes there is no absorption but simply a hard constraint 
 against semantic cooccurrence [sic about oo, which is really the 

All of which may be ignored by people with mathematical or programming
training! One of the advantages of the demise of copy-editors in
scholarly publishing is that there's no longer anybody to interfere
with one's logical punctuation.

 What I just wrote in my other email
  [...] but (as most people here will know), there it has a 
 different function.
 is actually a punctuation mistake (there is descriptively no room to 
 maneuver here): with the parenthetical phrase, there is a strong need 
 for a comma before there (though there's a bit of wiggle room wrt 

But as in many cases where neither option seems quite right, there's a
third option that's better than either. Had you marked the parenthesis
with commas instead of parens, as would be usual in non-technical
writing, there would be no problem.

 But everyone is familiar with the much more common case of one wanting 
 to write (,  (and space absorption doesn't work here) or ,) in lists 
 with a parenthetical element.

I wondered how familiar I was, and couldn't come up with an example!
Do you have a real-life example? (In non-technical English rather than
Englished mathematics.)

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

2013-02-02 Thread Julian Bradfield
On 2013-02-02, Richard Wordingham richard.wording...@ntlworld.com wrote:
 On Fri,  1 Feb 2013 23:51:34 + (GMT)
 Julian Bradfield jcb+unic...@inf.ed.ac.uk wrote:
 ...
 But if you use a member of  the Keyman family of inputs methods (I've
 been using Keyman for Linux (KMFL), you can set up a keyboard so you
 just enter that using XSAMPA keystrokes, e.g.

I never got round to learning SAMPA; I either use a standard input
method for the relevant language, or I use my own mnemonic system.
I only do non-trivial typing in Emacs, so I don't worry about X input
methods.

 you have to remember to type a_L_k to get the NFC form à̰ rather than
 a_k_L, which delivers the NFD form à̰, but do you not have to remember
 the order of diacritics anyway? Simple codepoint-sequence based
 searching only works if diacritics are in the correct order.

Well, as it happens, the diacritic-heavy orthography has been
displaced by one that only uses tone diacritics (and those are used
only in the dictionary). So ǂhèẽ and ǃn̥à̰ĩ would be written ǂhhèen and
nhǃàqin. I was working with older data. (And in truth, most of the
time I used a phonological markup (e.g. \DelAsp{ǂ}\Pln{è}\Nas{e} and
\VclNas{ǃ}\Phg{à}\Nas{i}) as I was working with four different
transcriptions with non-bijective mappings! But that's another matter.)

 Having set up an NFC-deliverinɡ XSAMPA-based keyboard so that it had
 rules O = ɔ, O\ = ʘ, O\\ = O, I’ve found it would be a lot more
 useful if I’d been a lot less puristic and set it up so that I had O =
 O, O\ = ɔ, O\\ = ʘ.  I use multiple backslashes to get some additional
 characters and recover ASCII, an idea I ɡot from Martin Hosken’s IPA
 keyboard.  I’m currently pondering how to maintain puristic and
 ‘practical’ versions from the same source files. Ideally I’d also merge
 in the related Emacs keyboard definition.

Yes, I don't like switching, so I use compose sequences for phonetics.
E.g. multi-key $ c (turned c) for ɔ , multi-key p a (phonetic a) for
ɑ, and so on, with various mnemonic prefixes, shape-based rather than
function-based.
If I were doing lots of it, I'd probably use function keys as dead
keys to replace the multi-key prefix.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

2013-02-01 Thread Julian Bradfield
On 2013-02-01, Costello, Roger L. coste...@mitre.org wrote:
 So why would one ever generate text in decomposed form (NFD)?

Text that I type is quite likely to be in decomposed (or at least not
composed) form, because I find it a lot easier to have a few keystrokes
for combining accents than to set up compose key sequences for all the
possible composed characters.
For example, 
ǂhèẽ-ǂhèẽ ǃn̥à̰ĩ-ǃn̥à̰ĩ
was part of the title of a talk. Is there a composed form of à̰? I
don't know, and don't want to!
Much easier to do searches and other text processing on it, too.
(The current dictionary project for this language uses NFD in its data
files, too.)



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: OT: Need howto info for typing accented chars on US keyboard in Linux

2013-01-13 Thread Julian Bradfield
On 2013-01-13, Leslie Turriff jlturr...@centurytel.net wrote:
   I've been searching the web for information about how to type accented 
 characters (French) using a US 104-key keyboard.  I understand that a compose 
 key is involved, but everything I've found so far has involved adding 
 character=key mappings using xmodmap, whereas it appears that one does not 
 need to do that; but there seems to be an assumption that just saying press 
 the compose key and... magic happens.

As others have said, most modern distributions are set up to do this
by default.
The thing to beware of is that GTK+, the toolkit used by many
applications such as Firefox etc., does its own thing with compose
processing, rather than relying on the underlying X processing.
(Sadly typical of GTK+.)
So if you succeed in working out how to change the X mappings (which
is not trivial), it won't work with GTK+ applications.
Emacs also has its own facility, but it also uses the native X compose
mappings before its own.

In summary, if your needs are just commonplace accents, it should just
work, and in most cases the compose sequences are obvious, and if
they're not, they're in the Wikipedia article.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: I missed my self-imposed deadline for the Mayan numeral proposal

2012-12-21 Thread Julian Bradfield
On 2012-12-21, Clive Hohberger cp...@case.edu wrote:
 Don't worry, I think you now have another 5351 years until the next Mayan
 Doomsday...

It's only 394 years till the next b'ak'tun.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Caret

2012-11-12 Thread Julian Bradfield
On 2012-11-12, QSJN 4 UKR qsjn4...@gmail.com wrote:
 A caret is a flashing line, block, or other picture in the client area
 of a window, it indicates the place (between two characters) at which
 text will be inserted (or the edge of the text to be selected or
 deleted). What does it mean? Between? There is no between in the
 bidirectional text, the previous and the next character are not
 necessary nearby! There is no distinct place for the insertion because
 the place depends of direction of the inserted characters.

What exactly a caret/cursor means is dependent on the system. On my
editor, the cursor is not logically *between* characters, it's
logically *on* a character, so there is no ambiguity.

 Other example, without bidi. You see:
 pré|sent
 Where is your caret? After é? after  ́ ? between e and  ́ ? Press

Again, it depends. A user-oriented editor will treat é as a single
unit anyway, for text manipulations. In my programmer-oriented editor,
when the cursor is on e or  ́, the two codepoints are displayed
separately instead of combined, so again there is no ambiguity.



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: texteditors that can process and save in different encodings

2012-10-04 Thread Julian Bradfield
On 2012-10-04, Stephan Stiller stephan.stil...@gmail.com wrote:
 In your experience, what are the best (plaintext) texteditors or word
 processors for Linux / Mac OS X / Windows that have the ability to *save* in
 many different encodings?

Emacs doesn't suit your needs?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: A strange symbol in a Soviet calendar

2012-09-07 Thread Julian Bradfield
On 2012-09-04, Leo Broukhis l...@mailcom.com wrote:
 My question is about the symbol before the name Уот. Has anyone seen
 it before? Is it a NE arrow in a square or a spade? What does it mean?

Might it simply be an arbitrary dingbat used to separate the list of
associated saints from the list of revolutionary heroes? Old vs new
saints:-)

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: User-Hostile Text Editing (was: Unicode String Models)

2012-07-22 Thread Julian Bradfield
On 2012-07-21, Richard Wordingham richard.wording...@ntlworld.com wrote:
 Are there any widely available ways of enabling the deleting of the
 first character in a default grapheme cluster?  Having carefully added
 two or more marks to a base character, I find it extremely irritating
 to find I have entered the wrong base character and have to type the
 whole thing again. As one can delete the last character in a cluster,
 why not the first? It's not as though the default grapheme cluster is
 usually thought of as a single character.

What do you mean by widely available?
A decent editor should let you choose whether to break apart clusters
or not. I presume that such editors exist! (Mine always breaks
clusters, but that's because I'm the only user, and I don't care
enough to implement clustering;-) Yudit might be one, but since it
seems to have no documentation, I can't tell.
If yours doesn't, then get on to its authors!


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-17 Thread Julian Bradfield
On 2012-07-16, Philippe Verdy verd...@wanadoo.fr wrote:
 I am also convinced that even Shell interpreters on Linux/Unix should
 recognize and accept the leading BOM before the hash/bang starting
 line (which is commonly used for filetype identification and runtime
 behavior), without claiming that they dont know what to do to run the
 file or which shell interpreter to use.

Do you think they should also recognize and accept ISO-2022 escape
sequences before the hashbang? If not, why not?
The kernel doesn't know or care about character sets. It has a little
knowledge of ASCII (or possibly EBCDIC) hardwired, but otherwise it deals
with 8-bit bytes. It has no concept of text file.
A file to be interpreted by a hashbang could in principle contain
arbitrary binary stuff, be that text in multiple encodings or just
binary data. That stuff belongs to the input to the interpreter, not
to the hashbang line: that line contains a filename which is not
intepreted in any extended charset.



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-13 Thread Julian Bradfield
On 2012-07-12, Michael Everson ever...@evertype.com wrote:
 On 12 Jul 2012, at 22:20, Julian Bradfield wrote:

 But wanting to do so would be crazy. My mu-nu ligature is, as far as I know, 
 used only by me (and co-authors who let me do the typesetting), and so if 
 Unicode has any sanity left, it would not encode it.

 Is it in print? 

Of course it's in print. The true ligature is only in the tech reports
and preprints that I produced myself (e.g.
http://www.lfcs.inf.ed.ac.uk/reports/98/ECS-LFCS-98-385/index.html ).
The journal versions have a hacked symbol which is just mu nu  kerned
to overlap appropriately. Sadly, this was before the days when
TeX systems were sufficiently well standardized that one had a
fighting chance of including fonts with the papers!

 My colleagues in the Edinburgh PEPA group did try to get their pet symbol 
 encoded (a bowtie where the two triangles overlap somewhat rather than just 
 touching), but were refused; although that symbol now appears in hundreds of 
 papers by dozens of authors from all over the world.

 If so, then it should be encoded. 

The relevant person is on holiday at the moment, but I'll find out
from him the real story of the symbol. I think this was before the
supplementary planes opened up.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-13 Thread Julian Bradfield
On 2012-07-13, Michael Everson ever...@evertype.com wrote:
 On 13 Jul 2012, at 11:07, Julian Bradfield wrote:

 So... U+1D7CC MATHEMATICAL ITALIC SMALL MU NU LIGATURE, since it's published 
 and (assuming the work is worthy; I cannot judge) might be cited by others.

It *might*, by some hapless master's student regurgitating the
proof. But that doesn't mean it should be encoded. It's an ad hoc
symbol, local to a particular series of papers (in so far as there is
a customary symbol, it's \sigma, but in that series of papers I needed
\sigma for another purpose), and within those papers, local to the
proof of particular theorems.
Anybody who reads the papers with understanding will realize that, and
therefore feel free to use any other symbol that is convenient to
them, if they don't feel like putting together a mu-nu symbol.
Once, I used $\mu \atop \nu$ (a small mu on top of a small nu) instead
-- that's in print too, in a very expensive book! Would you want to
encode that too?

If you're looking for more characters to encode, I'd rather see
COMBINING DOUBLE TILDE BELOW
which is used in Ladefoged and Maddieson, The Sounds of the World's
Language, to denote the strident vowels of Khoisan languages, and
which I therefore use in my work too.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-12 Thread Julian Bradfield
[ Please don't copy me on replies; the place for this is the mailing
  list, not my inbox, unless you want to go off-list. ]

On 2012-07-11, Hans Aberg haber...@telia.com wrote:

Unicode has added all the characters from TeX plus some, making it
possible to use characters in the input file where TeX is forced to
use ASCII. This though changes the paradigm, and it is a question of
which paradigm one wants to adhere to. 

This doesn't seem to make much sense, or have much truth, to me.

TeX does not have a notion of character in the Unicode sense. TeX is a
(meta-)programming language for putting ink on paper. It ultimately
produces instructions of the form print glyph 42 from font cmr10 at
this position. It does not know or care whether the glyph happens to
be a representation of some Unicode character. (It also isn't tied to
ASCII for its input - when I first used TeX, it was on an EBCDIC
system.)

There are many characters that TeX users use that are not in
Unicode. Indeed, you can't even correctly represent the name of the
system in Unicode, or any other plain text system - an entirely
deliberate choice by Knuth to emphasise that TeX is a typesetting
program, not a text representation format.

Because TeX is agnostic about such matters, one can set up any
convenient encoding for the input data (which is really the source
code of a program). For example, I have written documents in ASCII,
Latin-1, Big5, GB, UTF-8 and probably others. This is very convenient;
but it's only a convenience.

If one uses UTF-8, then one has the problem of how to deal with the
case where Unicode trespasses on TeX's territory, by specifying font
styles. 
This is not hard: for example, the obvious thing to do is to
arrange for the Unicode MATHEMATICAL SMALL ITALIC M to be an
abbreviation for \mathit{m}, and so on.
Note, incidentally, that this is not the same as the meaning of a
plain ASCII (or EBCDIC) m in TeX. In TeX math mode, the meaning of
m is dependent on the currently selected math font family: just as
in plain text, the font of of m depends on the currently selected
text font.

One problem, of course, is that there is no MATHEMATICAL ROMAN set of
characters. This is one of the biggest botches in the whole
mathematical alphanumerical symbol botch. If you encode semantic font
distinctions without requiring the use of higher-level markup, then
you need to encode also letters that are semantically distinctively
roman upright. The square root of -1 cannot be italicized in the
statement of a theorem, unlike all the is that appear in the text of
the theorem. Yet Unicode provides no way to mark this semantic
distinction between the characters, and has to rely on the
higher-level markup distinguishing maths (to which some font style
changes should not be applied) from text (in which they should).

A more general problem is that which font styles are meaningful,
depends on the document. For example, I give lectures and talks, and I
set my slides in sans-serif. As I don't (usually) use distinctive
sans-serif symbols in my work, the maths is all in sans-serif
too: form, not content. But what then should I see if I type a Unicode
mathematical italic symbol in my slides? Serif, or sans-serif?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: UTF-8 BOM (Re: Charset declaration in HTML)

2012-07-12 Thread Julian Bradfield
On 2012-07-12, Steven Atreju snatr...@googlemail.com wrote:
 In the future simple things like '$ cat File1 File2  File3' will
 no longer work that easily.  Currently this works *whatever* file,
 and even program code that has been written more than thirty years
 ago will work correctly.  No!  You have to modify content to get it
 right!!

Nice rant, but actually this has never worked like that. You can't cat
.csv files with headers, html files, images, movies, or countless
other just files and get a meaningful result, and never have been
able to.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-12 Thread Julian Bradfield
On 2012-07-12, Hans Aberg haber...@telia.com wrote:
 There are many characters that TeX users use that are not in
 Unicode.

 All standard characters from TeX, LaTeX, and AMSTeX should be there,

What's a standard character? There's no such thing.
To take a random entry from the LaTeX Symbol Guide, where is the
\nrightspoon symbol from the MnSymbol package? (A negated multimap
symbol.)

Not to mention the symbols I've used from time to time, because

 them. In math, you can always invent your own characters and styles,

people do.

 in fact you could do that with any script, but it is not possible
 for Unicode to cover that. There are though a public use area, where
 one can add ones own characters. 

You mean private use. Crazy thing to do, because then you have to
worry about whether your PUA code point clashes with some other
author's PUA code point.

 Because TeX is agnostic about such matters, one can set up any
 convenient encoding for the input data (which is really the source
 code of a program). For example, I have written documents in ASCII,
 Latin-1, Big5, GB, UTF-8 and probably others. This is very convenient;
 but it's only a convenience.

 UTF-8 only is simplest for the programmer that has to implement it.

Some of us are more concerned with users than programmers. Beside, all
the work for the legacy encodings has already been done. I wouldn't
ever want to go back to ISO alphabet soup for Latin etc., but for
CJK, the legacy codings are still sometimes convenient - for example,
if I write in Big5, I don't have to worry about telling my editor to
find a traditional Chinese font rather than a simplified or japanese
font. It uses a Big5 font, and that's it.

 LuaTeX and the older XeTeX support UTF-8. They are available in TeX Live.
   http://www.tug.org/texlive/

They aren't TeX. Neither working mathematicians nor publishers nor
typesetters like dealing with constantly changing extensions and
variations on TeX - one of the biggest selling points of TeX is
stability. (Defeated somewhat by the instability of LaTeX and its
thousands of packages, but that's another story.)
If I need to write complex - or even bidi - scripts routinely, I'd
probably be forced into one of them; but the typical mathematician
doesn't.

 
 One problem, of course, is that there is no MATHEMATICAL ROMAN set of
 characters. This is one of the biggest botches in the whole
 mathematical alphanumerical symbol botch.

 This was discussed here before; the LaTeX unicode-math package has options to 
 control that (see its manual). For example, one gets a literal interpretation 
 by:

Exactly. TeX can do what it likes. But you said it was an incompatibility
with Unicode that TeX sets plain ASCII math letters as italic,
implying that TeX should not be allowed to do what it likes.

 If you encode semantic font
 distinctions without requiring the use of higher-level markup, then
 you need to encode also letters that are semantically distinctively
 roman upright.

 It has already been encoded as mathematical style, see the Mathematical 
 Alphanumeric Symbols here:
   http://www.unicode.org/charts/

*You* look. The plain upright style is unified with the BMP characters.

 A more general problem is that which font styles are meaningful,
 depends on the document. For example, I give lectures and talks, and I
 set my slides in sans-serif. As I don't (usually) use distinctive
 sans-serif symbols in my work, the maths is all in sans-serif
 too: form, not content. But what then should I see if I type a Unicode
 mathematical italic symbol in my slides? Serif, or sans-serif?


 It is up to you. The unicode-package, mentioned above, has options to control 
 that.

Of course it's up to me. I'm glad you agree. So why say that it's an
incompatibility with Unicode that TeX (by default) displays ASCII as
italic in maths? Are you changing your mind on that? I welcome that if
so, as that was what I found surprising.

(And, of course, it's much easier to use the established TeX
mechanisms for controlling these things, than to learn more options
for a package to allow me to use symbols that are hard to type and
even harder to distinguish clearly on screen.)

 It is traditional in pure math, and also in the physics books have looked 
 into, to always use serif. Possibly sanf-serif belongs to another technical 
 style. Unicode makes it possible to mix these styles on the character level, 
 if you so will.

It's also traditional, for mostly good reasons to do with the limited
resolution of projectors, to use sans-serif in presentations. The only
reason that most people still have serifed maths is that they don't
know how to do otherwise (\usepackage{cmbright} is enough for most
people, if only they knew).

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-12 Thread Julian Bradfield
On 2012-07-12, Hans Aberg haber...@telia.com wrote:
 On 12 Jul 2012, at 12:33, Julian Bradfield wrote:
 In practice, no working mathematician is going to use the mathematical
 alphanumerical symbols to write maths in (La)TeX, because it's
..
 the Unicode mathematical symbol model does not match how one uses
 mathematical symbols.

 It is used by proof assistants such as Isabelle, and also in logic.
   https://en.wikipedia.org/wiki/Isabelle_(proof_assistant)

No it isn't. Isabelle uses (essentially) TeX control sequences
internally, though it writes them as \oplus rather than \oplus .
A small number of these are mapped to Unicode code points for display
and input purposes, and that small number does not include any of the
mathematical alphanumerical symbols block.

 If your only objective is to achieve a rendering for humans to read, TeX is 
 fine, but not if one wants to communicate semantic information on the 
 computer level.

On the contrary, computers are very happy with TeX notation. There are
several useful mathematical online learning sites (such as, for
example, Alcumus) which use TeX syntax to interact with the students.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-12 Thread Julian Bradfield
Hans wrote:
On 12 Jul 2012, at 15:54, Julian Bradfield wrote:
..
 Not to mention the symbols I've used from time to time, because

You tell me, because I posted a request for missing characters in different 
forums. Perhaps you invented it after the standardization was made?

Why on earth would I care about whether my pet symbol (a mu-nu
ligature, which I started using to stand for mu or nu as appropriate
when I ran out of other plausible letters for it) is in Unicode? It
would be crazy to put it there, and of precious little benefit to me,
since I don't wish to write web pages about this stuff.

 them. In math, you can always invent your own characters and styles,
 people do.
You and others knowing about those characters must make proposals if you want 
to see them as a part of Unicode.

But wanting to do so would be crazy. My mu-nu ligature is, as far as I
know, used only by me (and co-authors who let me do the typesetting),
and so if Unicode has any sanity left, it would not encode it. My
colleagues in the Edinburgh PEPA group did try to get their pet symbol
encoded (a bowtie where the two triangles overlap somewhat rather than
just touching), but were refused; although that symbol now appears in
hundreds of papers by dozens of authors from all over the world. (I
think they wanted it so they could put it on web pages, which they
have lots of.)

Putting a symbol into Unicode imposes a huge burden on thousands of
people. Everybody who thinks it important to be able to display all
Unicode characters (or even all non-Han characters) has to make sure
that their font has it, or that the distribution they package has it,
or that all the software in the world knows how to find a font that
has it. Such effort is entirely inappropriate for symbols used ad hoc
by a small community, who are communicating in any case via either
fully typeset documents or by TeX pseudocode - or, on occasion, with
real TeX and a suitable font definition.

 You mean private use. Crazy thing to do, because then you have to
 worry about whether your PUA code point clashes with some other
 author's PUA code point.

There is some system for avoiding that. Perhaps someone else here can inform.

There are many such systems - I don't need help or advice on this
matter. But none of them is appropriate for a symbol that perhaps you
want only for a few papers.

 UTF-8 only is simplest for the programmer that has to implement it.
 Some of us are more concerned with users than programmers.
Well, if the programmers don't implement, you are left out in the cold.

I'm not - if I care enough, I'll do it myself. Although most of my
work has actually been implementing utf-8 - as I said, the legacy
encodings are usually already done.

 Neither working mathematicians nor publishers nor
 typesetters like dealing with constantly changing extensions and
 variations on TeX - one of the biggest selling points of TeX is
 stability. (Defeated somewhat by the instability of LaTeX and its
 thousands of packages, but that's another story.)
 If I need to write complex - or even bidi - scripts routinely, I'd
 probably be forced into one of them; but the typical mathematician
 doesn't.

I do not see your point here.

The point is that you don't use unstable rapidly changing systems for
anything that has an expected life of more than a year or two; and if
you're planning for somebody else to use it, you try to give them
something that runs on systems at least ten years older than yours.

No. TeX cannot handle UTF-8, and I recall LaTeX's capability to emulate that 
was limited.

Somewhat limited, but good enough for every purpose I've so far needed
(maths, phonetics; and European, Indic, Chinese, Hebrew languages in
small snippets rather than entire documents). The main annoyance is
that combining character support is clunky, and that TeX really
doesn't support bidi properly - as I said - though it's remarkable
what hacking can be done.

 you need to encode also letters that are semantically distinctively
 roman upright.
 
 It has already been encoded as mathematical style, see the Mathematical 
 Alphanumeric Symbols here:
  http://www.unicode.org/charts/
 
 *You* look. The plain upright style is unified with the BMP characters.

Yes, that is why the Unicode paradigm departs from the TeX one.

This is as bad as Naena Guru... Unicode characters are
fontless. They are plain text. The Unicode standard even has a
nice little picture (Figure 2-2) showing how roman A, squashed A, bold
italic A, script A, fancy A, sans-serif A, brush-stroke A, fancy
script A, and versal capital A are all just LATIN LETTER A.

Now, in response to the desire of some mathematicians (maybe) to
write webpages without having to use clunky HTML markup (which is even
worse to use than TeX's), Unicode saw fit to encode characters such as
MATHEMATICAL BOLD ITALIC CAPITAL A.
This is not a logical problem: that character is distinguished from
LATIN LETTER A by the fact that its acceptable glyph variants

Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-12 Thread Julian Bradfield
On 2012-07-12, Hans Aberg haber...@telia.com wrote:
On 12 Jul 2012, at 16:06, Julian Bradfield wrote:
 On 2012-07-12, Hans Aberg haber...@telia.com wrote:
 On 12 Jul 2012, at 12:33, Julian Bradfield wrote:
 In practice, no working mathematician is going to use the mathematical
 alphanumerical symbols to write maths in (La)TeX, because it's
 ..
 the Unicode mathematical symbol model does not match how one uses
 mathematical symbols.
 
 It is used by proof assistants such as Isabelle, and also in logic.
  https://en.wikipedia.org/wiki/Isabelle_(proof_assistant)
 
 No it isn't.

Yes, I posted before here some example of people using it.

I beg your pardon, you're right. I didn't read it closely enough.

 Isabelle uses (essentially) TeX control sequences
 internally, though it writes them as \oplus rather than \oplus .
 A small number of these are mapped to Unicode code points for display
 and input purposes, and that small number does not include any of the
 mathematical alphanumerical symbols block.

You're right, it does default to using that block in Unicode mode.

Latest version requires STIXFonts to be installed. Some other proof assistants 
use it.

However, that's not true. Isabelle does not need to use Unicode; it
runs happily in an ASCII terminal, because its internal representation
is tokens, not Unicode characters. The Unicode is syntactic sugar
that's part of the Emacs interface and the Scala interface.

 If your only objective is to achieve a rendering for humans to read, TeX is 
 fine, but not if one wants to communicate semantic information on the 
 computer level.
 
 On the contrary, computers are very happy with TeX notation. There are
 several useful mathematical online learning sites (such as, for
 example, Alcumus) which use TeX syntax to interact with the students.

TeX formulas are just for rendering. For example, if you want to have 
superscript to the left, you have to write ${}^x y$.

If you read any introduction to TeX, it will explain how you use
macros to provide a structured markup. If you were using that
notation, then you would define a suitable macro, say 
\def\tetration#1#2{{}^{#2}{#1}}
and write $\tetration{y}{x}$.
This does not depend on any fancy Unicodery for its interpretation,
and also allows you to define semantic content for the computer.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-11 Thread Julian Bradfield
On 2012-07-11, Hans Aberg haber...@telia.com wrote:
 There are a number of other incompatibilities between original TeX and 
 Unicode:

 For example, ASCII letters are in TeX math mode typeset in italics, but 
 Unicode has a mathematical italics style, so ASCII letters should be typeset 
 upright in a strict Unicode mode. And similar for Greek letters, I gather.

Unicode is about plain text. TeX is about fine typesetting.
There's no reason why TeX should typeset ASCII as upright, any more
than it should typeset \begin{section} as that literal string! The
use of ASCII characters in math mode is simply an input convention, to
indicate the desired output of italic letters in a style appropriate
for single-letter mathematical variables.
The use of other Unicode characters in TeX input files is also simply
an input convention; how they get typeset depends on many other things
than what they look like in the code charts.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Too narrowly defined: DIVISION SIGN COLON

2012-07-10 Thread Julian Bradfield
On 2012-07-10, Asmus Freytag asm...@ix.netcom.com wrote:
 On 7/10/2012 3:50 AM, Leif Halvard Silli wrote:
 Asmus Freytag, Mon, 09 Jul 2012 19:32:47 -0700:
 The European use (this is not limited to Scandinavia)
 Thanks. It seems to me that that this tradition is not without a link
 to the (also) European tradition of *not* using the DIVISION SIGN (÷)
 for division.

 Is it _ever_ used for division? I'm curious, right now I can't recall 
 ever having seen an example.

Depends whether you think Britain is in Europe;-)

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Unicode Core

2012-06-22 Thread Julian Bradfield
Asmus wrote:
The Unicode Standard easily uses hundreds of fonts for the code charts, 
from a variety of sources. Despite what should theoretically work, not 
all systems can actually print every code chart. Some users cannot print 
certain of the existing PDFs on their systems, and POD providers have 
similar issues. The Unicode code charts provide a very nice stress 
test for some aspects of rendering, it turns out.

So, as long as code charts create production issues, print-on-demand for 
them is effectively not feasible.

My hard-copy of the code charts was printed by Lulu - they're too big
to print out on my office laserprinters!
The only issue was joining together the fonts that had been split up
when the charts were split into separate PDFs; but the Consortium
wouldn't have that problem, as it would just generate the entire PDF
as one document. (And unlike me, the Consortium probably has
Distiller.)

The standard annexes exist in HTML format. For Unicode 5.0, I took the 

That's more of an issue - I hadn't realized the annexes were actually
composed in HTML - I'd assumed they were written in a high-level
markup language and the HTML generated.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Unicode Core

2012-06-21 Thread Julian Bradfield
On 2012-06-21, Michael Everson ever...@evertype.com wrote:
 On 21 Jun 2012, at 09:47, Raymond Mercier wrote:
 While I am very glad to have this, I really do wonder why there was not a 
 full publication of Unicode 6 or 6.1 from the corporation itself, with all 
 the charts, as we have had with Unicode 1 to 5. Surely there is a market for 
 this ?

 Perhaps less than us character mavens would imagine. Books don't publish 
 themselves, and publishing takes resources of various kinds. 

Not much, if they use the Lulu route, as they already have an account
set up. An hour of somebody's time should do it.
And at a Lulu price, there'll be a lot more of a market than at an
Addison-Wesley price!


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Latin chi and stretched x

2012-06-08 Thread Julian Bradfield
Szelp, A. Sz. wrote:
Julian, if you look closely, it is not actually a turned s, but something
created with a turned s in mind. In the very sort of the alphabet, the
regular s has equal (or near-equal) top and bottom bowls. the turned one
has an emphasized upper bowl, which of course stems from the idea of a
turned s (as some fonts have a larger bowl lower bowl of s for balance),
but it is quite clearly not a turned s as identity, but rather something
_inspired_ by a turned s.

Quite clearly wrong! I'm afraid you're suffering from optical delusion.
I actually thought the same when I first looked at it, but it's not
so.
Cut out the turned s; then cut out, say, the initial s of
sonant. Rotate it 180 degrees. They're identical, up to the
tiny variations due to actual ink from metal type.
(Beware that the ś immediately below is from a different fount, and
*does* have more equal bowls. That's what confused me at first.)

Of course, since this was printed in the age of metal type, it *has*
to be a turned s. Cutting a special type would cost far more, and as
David pointed out in his original post, the reason for the absurd
turned p and turned s was the the publishers weren't willing to cut
the extra types to match the letters in the original hand-written script.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.





Re: Latin chi and stretched x

2012-06-08 Thread Julian Bradfield
But if that linked image contains the full alphabet, then there is no
regular d, which would be confusable with the rotated p. So in fact,

Yes, there is. Try reading the paragraph at the bottom of the page.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Latin chi and stretched x

2012-06-08 Thread Julian Bradfield
On 2012-06-08, David Starner prosfil...@gmail.com wrote:
 On Fri, Jun 8, 2012 at 11:39 AM, Denis Jacquerye moy...@gmail.com wrote:
 Are you sure it's not the opposite? Dorsey had a typewriter that
 didn't have his turned letters, so he used crossed lines below to
 indicate what letters should be turned when printed.

 I don't have a source to refer to, but two things make me find my
 memory more likely here. One, this work was done in 1881 and there
 were no field-portable typewriters then; IIRC, typewriters as a whole
 were rare and he probably sent in his work handwritten.

Dorsey's notes for his Omaha-Ponca dictionary are available online at
http://omahalanguage.unl.edu/dictionary/index.html
They are typed; and the turned letters are notated with a
(hand-written) cross below them, as Denis said. There are also some
handwritten annotations using the same convention. I don't know when
this was done, but at least by then he seems to have been using the
same notation in manuscript as in typescript.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Latin chi and stretched x

2012-06-07 Thread Julian Bradfield
On 2012-06-07, Denis Jacquerye moy...@gmail.com wrote:
 On Thu, Jun 7, 2012 at 12:39 AM, Karl Pentzlin karl-pentz...@acssoft.de
 I agree, we should avoid bad typography. But isn't a Latin chi (the
 IPA Latin chi being proposed) with Greek weights instead of Latin
 weights bad typography? Probably, that glyph still doesn't blend in
 with other Latin glyphs.

Hear! Hear!
When I first got involved with the IPA thirty years ago, I wrote to
them to complain about this. Sadly they ignored me, and have since
made matters even worse by printing with plain old Greek chi.
However, one must recognize that the reversed letters in IPA may also
have the wrong weights to harmonize well, because people implement
them by literal reversal, rather than by drawing the reverse shape
with the normal weight - almost always for ɜ. 
In some versions of the IPA chart, χ is printed with a
glyph that has almost equal weights on the two arms, possibly slightly
more Latin than Greek. (I can't track down the printed source; it's
almost certainly due to a careful choice of Greek font. The version
I'm thinking of is on the Web at
http://25.media.tumblr.com/tumblr_m4szvfypGo1rxozf7o1_1280.png )

Surely there is no basis for distiguishing characters solely on
the basis of weights that are an artefact of the writing device -
nobody would propose using or encoding LATIN SMALL LETTER REVERSED O,
I hope. 
If Latin chi is to be distinguished from stretched x, it would have to
be on the basis of the curved end strokes - and since some chis, Latin
or otherwise, have only very subtly curved end strokes, this is a poor
basis for a distinction. Not impossible, but likely to cause
confusion, I'd have thought.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Latin chi and stretched x

2012-06-07 Thread Julian Bradfield
David Starner wrote:
LATIN SMALL LETTER ROTATED P was used; see
http://commons.wikimedia.org/wiki/File:BAE-Siouan_Alphabet.png . It
has caused some whimpering among those trying to transcribe the text.

Urk! And there's rotated s as well.

Alright, I take it back. There is no limit to the barminess of script
inventors.
Obviously what we need are combining marks whose visual effect
is reversing/rotating the previous glyph. No, I didn't say that, I
really didn't say that...

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: [OT] Re: Exact positioning of Indian Rupee symbol according to Unicode Technical Committee

2012-05-29 Thread Julian Bradfield
On 2012-05-28, Doug Ewell d...@ewellic.org wrote:
...
 Again, just speaking about one platform (Windows) that seems to be in 
 somewhat common use, the problem is that the underlying architecture 
 doesn't support multiple dead keys on a single base character, nor does 
 it support a fifth, sixth, etc. shift state (unless one chooses to be 
 reckless and use Ctrl). This is unlikely to change in the next two to 
 three years. It isn't a matter of providing a layout—otherwise, anyone 
 with MSKLC and a supported Windows version could create one.
..
 Microsoft can never support ISO/IEC 9995-3:2010 unless they change their 
 keyboard handling architecture, as above.

Why is this a problem? The X keyboard handling has undergone a couple
of significant extensions of architecture over the years, and that
involved getting lots of people to agree. Microsoft can just do it.
And I don't see what the problem is, anyway: from a quick look at the
MS keyboard model, one could (as one does with X) process keystrokes
through a userspace library to get the desired effect. The keyboard
driver may only handle a couple of shift states and one dead
character, but an input library can do whatever it likes. No actual
need to extend all the keyboard drivers - it can all be done by
TranslateMessage(), can't it?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: [OT] Re: Exact positioning of Indian Rupee symbol according to Unicode Technical Committee

2012-05-29 Thread Julian Bradfield
On 2012-05-29, Doug Ewell d...@ewellic.org wrote:
 Did you read what I wrote? The *underlying architecture* of Windows key
 handling supports neither additional shift states nor multiple dead
 keys, both of which are required to support this standard. A new version
 of MSKLC on top of the existing architecture will not help.

Again, please could you explain how this is the case?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: [OT] Re: Exact positioning of Indian Rupee symbol according to Unicode Technical Committee

2012-05-28 Thread Julian Bradfield
On 2012-05-28, Doug Ewell d...@ewellic.org wrote:
 Karl Pentzlin replied to Jukka K. Korpela:
 JKK I don’t think there will be any standard on [how to type INDIAN
 RUPEE SIGN on a U.S. English keyboard].

 It is contained in the draft of ISO/IEC 9995-9 Multilingual,
 Multiscript Keyboard Group Layouts which is currently being submitted
 to DIS voting.

 ISO/IEC 9995-9 cannot be implemented natively on Microsoft Windows; it 
 requires a third-party add-on package such as Keyman, which is not
 free.

I don't understand this remark. Microsoft Windows is not free, so what
does it matter whether there's a free addon or not?
If ISO/IEC 9995-9 becomes a standard, then Microsoft will presumably
support it, either themselves or by buying Keyman.
If they don't, then there are other operating systems. 
Defects in one OS have no relevance for standards - although they may
be a pain for a little while (e.g. Linux' rather slow support for
Unicode). 



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: [unicode] Re: Canadian aboriginal syllabics in vertical writing mode

2012-05-17 Thread Julian Bradfield
On 2012-05-03, Michael Everson ever...@evertype.com wrote:
 On 3 May 2012, at 17:35, Asmus Freytag wrote:
 But it would not give an answer to the underlying question, on whether such 
 upright rendering would be the default choice - whether in its own script 
 context, or whether in the context of inserting material (quotes) in other 
 writing systems that do use vertical layout and have a long tradition of 
 doing so.

 We already know that. Rotated Syllabics text is confusing and illegible. This 
 follows directly from the structure of the script. 

 Likewise, I suspect, that no matter how you arrange it, stacked syllabics 
 will look odd enough that the natural way to render longer text that for 
 some reasons have to go vertically, would be rotated.

 I suspect otherwise. I know that un-rotated vertical Syllabics text 
 maintains the basic shapes of the Syllabics characters, and is therefore more 
 legible than rotated vertical Syllabics text, which automatically changes the 
 readings of many syllabics syllables. 


It took me a little while, but I finally managed to put this to an
Inuktitut speaker (Leena Evic of the Pirurvik Centre in Iqaluit, Nunavut).

Her response was that the rotated sidebars on the newsletter cited
earlier are entirely readable (in fact, I had to explain how there
could possibly be a problem), and that the vertical layout advocated
by Michael is not common, and in most cases not ideal.

It would thus appear that Michael is alone in finding rotated
syllabics hard to read.
He might have more luck with a language that doesn't use finals or
other raised letters, but off-hand I can't find one.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Origins of ẘ

2012-05-16 Thread Julian Bradfield
On 2012-05-16, Philippe Verdy verd...@wanadoo.fr wrote:
 The ring below is used in IPA but only under consonnants to make them
 voiceless. I don't know its usage under a vowel.

Err, it makes them voiceless. E.g., in Japanese, Satsuki is [satsɯ̥ki].

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: [unicode] Re: Canadian aboriginal syllabics in vertical writing mode

2012-05-01 Thread Julian Bradfield
On 2012-05-01, Michael Everson ever...@evertype.com wrote:
 than it is in English, except in neon). The examples you showed were
 made by people who hadn't thought about what they were doing. Since

Don't you think the native speakers might know what they're doing?

 Canadian Syllabics characters change their meaning when seen
 sideways, setting text in the way those two documents did it simply
 causes immediate confusion as to the legibility of the text.  

Not so. I've never looked at Canadian syllabics before, but it was
immediately obvious (thanks to the superscript characters) that it
was text rotated through 90 degrees, so if I wanted to read it (and knew
the script and the language), I would read it accordingly.
Whether there are character sequences that could be read meaningfully
both as vertical text and rotated text is an interesting question -
is your Inuktitut up to answering it?




-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Unihan database

2012-04-13 Thread Julian Bradfield
On 2012-04-13, Martin Heijdra mheij...@princeton.edu wrote:
 But now they report that the radical-stroke page itself has changed
 to encodings rather than images; and the radicals are not in the
 standard fonts. Hence, the search pages (clicking on the number of
 strokes of the radical)  shows something like

Ouch. Yes, the radicals aren't in the arial that's the default font
on my Linux system, so now I can't use radical search either.
Firefox font configuration is, um, not totally documented, either, so
I don't even know if it's possible to tell it to use a different font
for the radical block.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: Joining Arabic Letters

2012-03-30 Thread Julian Bradfield
On 2012-03-30, Andreas Prilop prilop4...@trashmail.net wrote:
 I think a better idea is to have joining glyphs always even for
 different typefaces. At least the Unicode Standard should say
 what should happen when Arabic characters of different typefaces
 follow each other.

How can it? Unicode is about plain text. As soon as you start talking
about different typefaces, you're out of scope.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




Re: [indic] Re: Tamil Anusvara (U+0B82) glyph shape [ Re: Dot position in Gurmukhi character U+0A33]

2012-02-09 Thread Julian Bradfield
On 2012-02-09, srivas sinnathurai sisri...@blueyonder.co.uk wrote:
 take it that by phonemes I mean different sounds.
 Now can you answer the 5 vowels as PoA in Tamil and numerous vowel sounds
 in day to day use in Tamil.
 clearly different sounds, not allophones massaging, not phoneme massaging.

So can you give us an example of two words that have two clearly
different sounds, but are written with the same vowel in the standard
orthography?

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




  1   2   >