Aw: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Jörg Knappen
 


From a practical point of view, text files contain text that is broken into lines. And by a long-standing tradition,

line breaks are treated differently among different operating systems. Whenever one transfers a text file between

operating systems, the process behing that transfer cares to convert the line breaks according to the target OS's conventions.

 

Binary files are much simpler: They can be just transfered without converting anything, even between different operating systems.

 

Of course, this does not mean that an executable under one OS remains being a valid exe under another OS, but there lots of non-executable

binaries that are useful independent of the OS (e.g. images, sound files, video files, lots of other application files).

 

So, for a successful file transfer one needs to know whether it is text or binary, and handle it accordingly.

 

--Jörg Knappen

 

Gesendet: Freitag, 21. Februar 2020 um 13:21 Uhr
Von: "Costello, Roger L. via Unicode" 
An: "unicode@unicode.org" 
Betreff: Why do binary files contain text but text files don't contain binary?




Hi Folks,

 

There are binary files and there are text files.

 

Binary files often contain portions that are text. For example, the start of Windows executable files is the text MZ.

 

To the best of my knowledge, text files never contain binary, i.e., bytes that cannot be interpreted as characters. (Of course, text files may contain a text-encoding of binary, such as base64-encoded text.)

 

Why the asymmetry?

 

/Roger








Aw: Geological symbols

2020-01-13 Thread Jörg Knappen
Hallo Thomas,

 

Unicode delegates this (combined superscripts and subscripts) to higher level markup languages or Rich Text Editors.

 

I don't know how widespread the use of LateX is among geologists, but notation like this is a perfect use case for LaTeX.

 

--Jörg Knappen

 
 

Gesendet: Montag, 13. Januar 2020 um 12:20 Uhr
Von: "Thomas Spehs (MonMap) via Unicode" 
An: unicode@unicode.org
Betreff: Geological symbols




Hi, I would like to ask if there is any way to create geological “symbols” with Unicode such as: Q₁¹ˉ², but with the two “1”s over each other, without a space. Thanks!








Aw: Re: Re: NBSP supposed to stretch, right?

2020-01-06 Thread Jörg Knappen
Festival season is over ...

 

I checked it out, LaTeX does the same for the input of an explicit no break space character.

 

--Jörg Knappen

 
 

Gesendet: Sonntag, 22. Dezember 2019 um 22:54 Uhr
Von: "Shriramana Sharma via Unicode" 
An: "Jörg Knappen" 
Cc: "Asmus Freytag" , "UnicoDe List" 
Betreff: Re: Re: NBSP supposed to stretch, right?


So I was wondering whether TeX only does this to the ~ input character or the actual NBSP Unicode character too?






Aw: Re: NBSP supposed to stretch, right?

2019-12-22 Thread Jörg Knappen
 


Well,

 

in TeX and LaTeX, the no break space (indicated by the active character ~ in TeX input files) is stretchable and stretches to a

normal inter-word space such that all inter-word spaces in a line are equal. But multiple no break spaces still add up to wider spaces

in the output unlike usual space tokens that are collapsed to one space token.

 

-- Jörg Knappen

 

Gesendet: Dienstag, 17. Dezember 2019 um 17:20 Uhr
Von: "Asmus Freytag via Unicode" 
An: unicode@unicode.org
Betreff: Re: NBSP supposed to stretch, right?



On 12/17/2019 2:41 AM, Shriramana Sharma via Unicode wrote:





On Tue 17 Dec, 2019, 16:09 QSJN 4 UKR via Unicode, <unicode@unicode.org> wrote:

Agree.
By the way, it is common practice to use multiple nbsp in a row to
create a larger span. In my opinion, it is wrong to replace fixed
width spaces with non-breaking spaces.
Quote from Microsoft Typography Character design standards:
«The no-break space is not the same character as the figure space. The
figure space is not a character defined in most computer system's
current code pages. In some fonts this character's width has been
defined as equal to the figure width. This is an incorrect usage of
the character no-break space.»



 

Sorry but I don't understand how this addresses the issue I raised.


You don't?

In principle it may be true that NBSP is not fixed width, but show me software that doesn't treat it that way.

In HTML, NBSP isn't subject to space collapse, therefore it's the go-to space character when you need some extra spacing that doesn't disappear.

I bet, in many other environments it was typically the only "other" space character, so it ended up overloaded.

My hunch is that it is too late at this point to try to promulgate a "clean" implementation of NBSP, because it would effectively change untold documents retroactively. So it would be a massively breaking change.

If you have a situation where you need really poor layout (wide inter-word spaces) to justify, the fact that a honorific in front of a name works more like it's part of the same word (because the NBSP doesn't stretch) would be the least of my worries. (Although, on lines where interword spaces are a reduced a bit, I can see that becoming counter-intuitive).

If you only fix this in software for high-end typography, you'd still have the issue that things will behave differently if you export your (plain) text. And you would have the issue of what to do when you want fixed spaces to be non-breaking as well (is that ever needed?).

A./







Aw: acute-macron hybrid?

2019-04-30 Thread Jörg Knappen

Does it also contrast with a circumflex? Historically, circumflexes were quite flexible in their graphical representation.

 

--Jörg Knappen

 

Gesendet: Dienstag, 30. April 2019 um 09:45 Uhr
Von: "Julian Bradfield via Unicode" 
An: unicode@unicode.org
Betreff: acute-macron hybrid?

The celebrated Bosworth-Toller dictionary of Anglo-Saxon uses a
curious diacritic to mark long vowels. It may be described as a long
shallow acute with a small down-tick at the right.
It contrasts with an acute (quite steep in this typeface) used to mark
accented short vowels.
Both can be seen in the fifth line of the scan at
http://lexicon.ff.cuni.cz/png/oe_bosworthtoller/b0002.png

What is its appropriate Unicode representation?
As a lumper, I would use a macron, but I wonder what a splitter would
say.





Two more ellispis-type interpunctations: ?.. and !..

2019-02-07 Thread Jörg Knappen
While working on a corpus of Kyrgyz language, a Turkic language written in the Cyrilic script,

I encountered two ellipsis-type interpunctations, namely ?.. and !..

 

Note that this is not (yet) a proposal to encode them a single Unicode characters although I would definitely

use such characters when available because they make the text processing tool chain much simpler and more

robust. It is a survey question:

 

Do you have encountered ?.. or !.. in other languages than Kyrgyz?

 

--Jörg Knappen


Aw: Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process)

2018-08-23 Thread Jörg Knappen

Asmus,

 

I know your style of humor, but to keep it straight:

 

All known human languages, even Piraha, have pronouns for "I" and "you".

 

--Jörg Knappen

 


Gesendet: Montag, 20. August 2018 um 16:20 Uhr
Von: "Asmus Freytag via Unicode" 
An: unicode@unicode.org
Betreff: Re: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on Emoji Selection Process)

 

What about languages that don't have or don't use personal pronouns. Their speakers might find their use odd or awkward.

The same for many other grammatical concepts: they work reasonably well if used by someone from a related language, or for linguists trained in general concepts, but languages differ so much in what they express explicitly that if any native speaker transcribes the features that are exposed (and not implied) in their native language it may not be what a reader used to a different language is expecting to see.

A./

 

 






Aw: Re: IBM 1620 invalid character symbol

2017-09-26 Thread Jörg Knappen

I found the character in question on p. 52, it is a picture of something handwritten, not a typeset character. "Clearly" means something different to me.

 

--Jörg Knappen

 

Gesendet: Dienstag, 26. September 2017 um 15:03 Uhr
Von: "John W Kennedy via Unicode" <unicode@unicode.org>
An: "Leo Broukhis" <l...@mailcom.com>, unicode@unicode.org
Betreff: Re: IBM 1620 invalid character symbol


I don’t know what your snippet is from, but the normally authoritative IBM manual, A26-5706-3, IBM 1620 CPU Model 1 (July, 1965) displays what is clearly the Cyrillic letter. Whether it should be regarded as that, or as a distinct character, is another question. See http://www.bitsavers.org/pdf/ibm/1620/A26-5706-3_IBM_1620_CPU_Model_1_Jul65.pdf
 

On Sep 26, 2017, at 12:48 AM, Leo Broukhis via Unicode <unicode@unicode.org> wrote:
 



Wikipedia (https://en.wikipedia.org/wiki/IBM_1620#Invalid_character) describes the "invalid character" symbol (see attachment) as a Cyrillic Ж which it obviously is not. 

 

But what is it? Does it deserve encoding, or is it a glyph variation of an existing codepoint?

 

The question is somewhat prompted by 

 


	
		
			2BFF
			1
			HELLSCHREIBER PAUSE SYMBOL
		
	


 

in the pipeline, although I learned about both earlier today within a few minutes of one another.

 

Thanks,

Leo

 














Aw: Re: LATIN CAPITAL LETTER SHARP S officially recognized

2017-07-03 Thread Jörg Knappen

No, the hyphenation oddity involving the addition of letters with hyphenation

(or, to be more precise, to suppress letters in unhyphenated words) never affected the letter s.

It affected other letters (I know examples for f, l, m, n, p, r, and t) when followed by a vowel, like in

Schiffahrt/Schiff-fahrt. It was always Sauerstoffflasche with three f's.

 

In the old (1910) spelling of German, ss at the word boundary obligatory became ß. When the

ß was replaced by ss (because of all caps or unavailability of the letter), all three s's were retained.


 

In the current orthography, the hyphenation oddity is removed completely.

 

--Jörg Knappen

 


Gesendet: Montag, 03. Juli 2017 um 09:43 Uhr
Von: "Alastair Houghton" <alast...@alastairs-place.net>
An: "Jörg Knappen" <jknap...@web.de>
Cc: a.lukya...@yspu.org, unicode@unicode.org
Betreff: Re: LATIN CAPITAL LETTER SHARP S officially recognized

On 2 Jul 2017, at 16:59, Jörg Knappen via Unicode <unicode@unicode.org> wrote:
>
> > Is it possible to design fonts that will render ẞ as SS?
>
> In fact, that has happened long before the capital letter sharp s was added to Unicode: The T1 encoding (aka Cork encoding) of LaTeX
> does this since 1990. The reason for this was correct hyphenation for German words rendered in all caps.

Wasn’t there also some oddity relating to hyphenation and “ss”/“SS” in general? I seem to recall that it used to be the case that you ended up with more “s”s than you started with when hyphenating a word containing “ss”…

Kind regards,

Alastair.

--
http://alastairs-place.net
 





Aw: Re: LATIN CAPITAL LETTER SHARP S officially recognized

2017-07-02 Thread Jörg Knappen

> Is it possible to design fonts that will render ẞ as SS?

 


In fact, that has happened long before the capital letter sharp s was added to Unicode: The T1 encoding (aka Cork encoding) of LaTeX

does this since 1990. The reason for this was correct hyphenation for German words rendered in all caps.

 

--Jörg Knappen

 


Gesendet: Samstag, 01. Juli 2017 um 08:51 Uhr
Von: "a.lukyanov via Unicode" <unicode@unicode.org>
An: unicode@unicode.org
Betreff: Re: LATIN CAPITAL LETTER SHARP S officially recognized

Is it possible to design fonts that will render ẞ as SS?

So we could choose between ẞ and SS by just selecting the proper font,
without changing the text itself.

Or perhaps there will be a "font feature" to select this rendering
within the same font.
 





Aw: Re: U+0261 LATIN SMALL LETTER SCRIPT G

2017-03-28 Thread Jörg Knappen

This is a script capital G or, in TeX notation, {\cal G}. It reflects the use of multiple styles of the same underlying alhabet in mathematics and sciences.

It is not a capital script g (note the different ordering of capital and script).


 

--Jörg Knappen

 


I had found in 2013 a GꞬ contrast in mathematical notations of an old (1952) physics book (see http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0092.html)

Frédéric
 





Aw: The usage of Z WITH STROKE

2016-11-25 Thread Jörg Knappen

Some anecdotal evidence:

 

I was taught by my math teacher (Germany, 1970s) to stroke all  z's (upper or lowercase) in order to

distinguish them from the digit "2"

 

--Jörg Knappen


 

P.S. What pan-turkic orthography is concerned, there were also a lot of pan-turkic Latin alphabets in revolutionary

Soviet Union (1920s) before Cyrillic alphabets were introduced in the Stalin era.

 

P.P.S. You are certainly aware of this article: https://en.wikipedia.org/wiki/Z_with_stroke


Gesendet: Freitag, 25. November 2016 um 15:38 Uhr
Von: "Janusz S. Bień" <jsb...@mimuw.edu.pl>
An: "unicode Unicode Discussion" <unicode@unicode.org>
Betreff: The usage of Z WITH STROKE


Hi!

There are two comments to the character(s) in the U0180 chart:

1. Pan-Turkic Latin orthography
2. handwritten variant of Latin “z”

Ad 1.

Do I understand correctly that the Pan-Turkic Latin ortography
refers to the initiative described in the post to the Linguist list:

https://linguistlist.org/issues/4/4-187.html

If so, where to find more information about it? I found already another
post to the Linguist list

https://linguistlist.org/issues/5/5-739.html

but it contains only very general information.

Ad 2.

I'm curious how widespread, in time and space, is/was this
convention. Can you suggest to me where to search for this information?

Best regards

Janusz


--
,
Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
 





Aw: Why incomplete subscript/superscript alphabet ?

2016-09-30 Thread Jörg Knappen
 

Sub- and Superscripts are considered "higher level markup" and not parts of plain text in UNicode. You can easily get at

them using LaTeX notation or HTML tags for sub- or superscripts.

 

So the question is: Why are there *some* sub- and superscript character in Unicode?

 

And the answer is: They were found in older charactersets and Unicode provides so-called "round-trip compatibility" to those

older character sets. The relevant older character sets happen not to cover a sensible full range of sub- and superscripts, therefore

the gaps in Unicode. It is very probable that those gaps will not be filled at any time.

 

--Jörg Knappen

 


Gesendet: Freitag, 30. September 2016 um 11:57 Uhr
Von: "Gael Lorieul" <glori...@coanda-deviation.info>
An: "Unicode Discussion" <unicode@unicode.org>
Betreff: Why incomplete subscript/superscript alphabet ?

Hello all,

I wonder why only a subset of the alphabet is available as subscript
and/or superscript ?

This is well illustrated on the table in the following Wikipedia page:

https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts#Latin_and_Greek_tables

Is there a reason for this ?

I would love to have these characters available because I often use
Unicode to write equations as comments of a source code. For instance:

class Term_diff_rotDivStressTensor_splitted
/**
* Computes:
*
* μ ⎛μ⎞ ⎡1 ⎤
* —.Δω + ∇⎜—⎟×Δu + ∇×⎢—.(∇u + ∇uᵀ)·∇μ⎥
* ρ ⎝ρ⎠ ⎣ρ ⎦
*/
{
[...] (class definition)
}


or a more problematic example:

/*
* ⌠tᵉⁿᵈ
* q(tᴺ) ← q(t⁰) +⎮ rhs(q,t) dt + (tᵉⁿᵈ - tˢᵗᵃʳᵗ)
* ⌡tˢᵗᵃʳᵗ
*/

Here "end" and "start" would have been better as subscripts, but I could
not do so because letter "d" is not available as a subscript…

As you can see, having only some letters available as subscript (&
superscript) is sometimes a pain…


Gaël Lorieul

PhD student in Computational Fluid Dynamics
at Université catholique de Louvain





Aw: Re: Adding half-star to Unicode?

2016-06-24 Thread Jörg Knappen

Talking about fancy five stars, besides the vertically split ones there is the "Anarchist star" (a symbol for anarcho-syndicalism)

with a diagonal split in a upper left red half and a lower left black half. Since there are political and ideological symbols encoded

in UNicode, maybe this one is worth encoding as well (probably twice, once as a black and white plain symbol and once as a colourful Emoji).

 

See here: https://commons.wikimedia.org/wiki/Category:Anarcho-Syndicalism#/media/File:Anarchist_star.svg


 

FIVE PIONTED STAR WITH BLACK LOWER RIGHT HALF = anarchist star

ANARCHIST STAR EMOJI

 

--Jörg Knappen

 


Gesendet: Freitag, 24. Juni 2016 um 14:12 Uhr
Von: "Frédéric Grosshans" <frederic.grossh...@gmail.com>
An: unicode@unicode.org
Betreff: Re: Adding half-star to Unicode?

Le 24/06/2016 00:37, Leo Broukhis a écrit :
> For a previous discussion on the topic, please see
> the thread "Missing geometric shapes" around 11/12/12
The thread starts here :
http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0008.html

It contains an example of half-filled star used in RTL (Hebrew) context,
in an advertisement in Haaretz here
http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0024.html

 





Aw: Joined "ti" coded as "Ɵ" in PDF

2016-03-18 Thread Jörg Knappen

I inspected the pdf file, and its font encoding is termed "Identity-H". I couldn't reveal much about this encoding, but it seems to be a private encoding of Adobe used especially for Asian fonts.

 

--Jörg Knappen

 

Gesendet: Donnerstag, 17. März 2016 um 17:43 Uhr
Von: "Don Osborn" <d...@bisharat.net>
An: unicode@unicode.org
Betreff: Joined "ti" coded as "Ɵ" in PDF

Odd result when copy/pasting text from a PDF: For some reason "ti" in
the (English) text of the document at
http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf
is coded as "Ɵ". Looking more closely at the original text, it does
appear that the glyph is a "ti" ligature (which afaik is not coded as
such in Unicode).

Out of curiosity, did a web search on "internaƟonal" and got over 11k
hits, apparently all PDFs.

Anyone have any idea what's going on? Am assuming this is not a
deliberate choice by diverse people creating PDFs and wanting "ti"
ligatures for stylistic reasons. Note the document linked above is
current, so this is not (just) an issue with older documents.

Don Osborn





Aw: Re: Enclosing BANKNOTE emoji?

2016-02-10 Thread Jörg Knappen

For the pound emoji, throw in ~90M Egyptians.

 

--Jörg Knappen

 

Gesendet: Dienstag, 09. Februar 2016 um 23:46 Uhr
Von: "Leo Broukhis" <l...@mailcom.com>
An: "Mark Davis ☕️" <m...@macchiato.com>
Cc: "unicode Unicode Discussion" <unicode@unicode.org>
Betreff: Re: Enclosing BANKNOTE emoji?









The emojiexpress.com site is useful to check which new emoji or combinations people actually use, but the stats are likely skewed by only measuring input from one platform.
 
Another way to look at the emojitracker.com stats:
 
339M people in the Eurozone : 389K uses of Euro emoji
126M people in Japan : 354K uses of Yen emoji
140M people in UK + Turkey (likely users of the Pound emoji as a stand-in for Lira) : 515K uses of pound emoji
 
The total is 605M people : 1258K uses of non-dollar emoji
Assuming the same average frequency of use, 2933K uses of the dollar emoji would be produced by 1411M people, out of which us + canada + mexico + australia   (500M) + other countries using $ as (part of) the sign for their currency are way less than a half. This means that substantially more than 500M people are using the dollar emoji by default, instead of emoji of their national currencies. Assuming a lesser frequency of use will result in a greater estimate of the affected population.
 
Leo

 


 
On Tue, Feb 9, 2016 at 8:51 AM, Mark Davis ☕️ <m...@macchiato.com> wrote:



Look at http://www.emojixpress.com/stats/. The stats are different, since they collect data from keyboards not twitter posts, but they have a nice button to view only the news emoji.

 

(The numbers on the new ones will be smaller, just because it takes time for systems to support them, and people to start using them. However, they bear out my predication that the most popular would be the eyes-rolling face).


 






 


Mark


 








 
On Tue, Feb 9, 2016 at 5:19 PM, Leo Broukhis <l...@mailcom.com> wrote:



A caveat about using emojitracker.com : it doesn't count newer emoji yet (e.g. U+1F37E bottle with popping cork is absent), thus, when they are added, their counts will be skewed.
 
Leo



 
On Tue, Feb 9, 2016 at 2:00 AM, Leo Broukhis <l...@mailcom.com> wrote:



Thank you for the links, quite mesmerizing!

On emojitracker.com (cumulative counts, but only on twitter, AFAICS), U+1F4B5 ($) had quite a respectable count of 2932622 (well above the middle of the page, around 70%ile), U+1F4B7 (pound) had 514536 (around 30%ile), and U+1F4B4 and U+1F4B6 had around 353K and 388K resp. (around 20%ile, but 10x more than the lowest counts, and about the same frequency as various individual clock faces). 
 
It is quite evident that the dollar banknote emoji serves as a stand-in for at least half a dozen of various currencies.






 
On Mon, Feb 8, 2016 at 10:25 PM, Mark Davis ☕️ <m...@macchiato.com> wrote:





I would suggest that you first gather statistics and present statistics on how often the current combinations are used compared to other emoji, eg by consulting sources such as:

 

http://www.emojixpress.com/stats/

or

http://emojitracker.com/


 






 


Mark


 








 
On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis <l...@mailcom.com> wrote:

There are

 U+01F4B4 Banknote With Yen Sign
 U+01F4B5 Banknote With Dollar Sign
 U+01F4B6 Banknote With Euro Sign
 U+01F4B7 Banknote With Pound Sign

This is clearly an incomplete set. It makes sense to have a generic
"enclosing banknote" emoji character which, when combined with a
currency sign, would produce the corresponding banknote, to forestall
requests for individual emoji for banknotes with remaining currency
signs.

Leo
 




























Aw: Re: Turned Capital letter L (pointing to the left, with serifs)

2016-01-05 Thread Jörg Knappen

I have looked up some printed sources and I agree with Michael Everson and Frédéric Grosshans that the

beast in question is a variant of the greek letter tau (capital or lowercase).

 

Here are the relevant sources I consulted:

 

Carl Faulmann: Das Buch der Schrift. Enthaltend die Schriftzeichen und Alphabete aller Zeiten und aller Völker des Erdkreises. Verlag der kaiserlich königlichen Staatsdruckerei. Wien 1878, 2. verm. und verb. Aufl. 1880 p.171

Hans Jensen: Die Schrift in Vergangenheit und Gegenwart, 3. Auflage p.459

 

Here is a quote from Hans Jensen:

   Noch in modernen Drucken finden wir die Formen ϐθϖ3ϲ7, wo andere βϑπζςτ haben.


Note: i had to fake the zeta symbol with a digit 3 and the tau symbol with a digit 7 here. In German typesetting tradition the theta symbol ϑ is the preferred form, not the straight theta θ.

 

My Opinion: The Greek Zeta Symbol and the Greek Tau Symbol are on the same footing as the "lunate sigma" alreay encoded in Unicode. They should be added in both lowercase and capital form.

 

--Jörg Knappen

 


Gesendet: Dienstag, 05. Januar 2016 um 06:08 Uhr
Von: "Asmus Freytag (t)" <asmus-...@ix.netcom.com>
An: unicode@unicode.org
Betreff: Re: Turned Capital letter L (pointing to the left, with serifs)



On 1/4/2016 1:33 PM, Frédéric Grosshans wrote:


I looked all the pages of the 1809 edition of _Theoria motus corporum coelestium in sectionibus conicis solem ambientium_  https://archive.org/stream/bub_gb_ORUOQAAJ where Gauss used this notation in pages 80-81. Almost all notations are standard enough to be familiar to any modern (2015) mathematician or physicist, with two exceptions : this "7" symbol and ☊ U+260A ASCENDING NODE (which is still standard in astronomy). The Greek letters in particular have a pretty standard shape, and I don't see why this symbol would be the only geek letter using a fancy cursive shape. Even the Latin letters used standard shapes ( italic, roman, a few capital fraktur).

That said,  I did not spot a tau in the text, while most of the Greek alphabet was used. Could "7" be a standard shape for tau in 1809 Hamburg ?


The problem is that he used capital Tau, which, in most fonts, looks precisely like capital Latin T. So, he used an alternate shape, the cursive one, which would have been familiar to him based on the fact that he probably studied Greek as part of his education, pretty standard subject at the time and even a hundred years later in upper level schools in Hamburg and elsewhere in Germany (and he would have seen and reproduced handwritten forms, not just printed ones).
 

However, I still think it is a ⦢ U+29A2 TURNED ANGLE

No, an angle would have two straight lines.

A Greek letter has, overall, a much higher probability of being used for a variable than almost any other symbol (the one non-letter symbol (Ascending node) is one that you say is still standard in astronomy - wheras any quick search of the literature of the 19th century shows that no symbol is consistently used for the "avery daily angle".

For all of these reasons, I find the suggestion of U+29A2 unconvincing.

A./

  Frédéric
 


Le lun 4 janv. 2016 21:38, Raymond Mercier <raym...@almanach.co.uk> a écrit :





On further reflection I can well agree that it is tau. The attached images from R. Barbour, Greek Literary Hands, show clearly (scan 3) the large upper case tau in several lines, and in scan 4 in the first and other lines a hooked version of tau. So I withdraw my suggestion of pi.







Raymond



 


From: Asmus Freytag (t)

Sent: Monday, January 04, 2016 7:58 PM

To: unicode@unicode.org

Subject: Re: Turned Capital letter L (pointing to the left, with serifs)



 









On 1/4/2016 10:41 AM, Michael Everson wrote:

Certainly it does look more like a very common variant of “tau” than “pi”

Variant of uppercase tau?

A./













Aw: Re: Turned Capital letter L (pointing to the left, with serifs)

2016-01-05 Thread Jörg Knappen

Sigh, I have to correct the attribution of the character identification, I meant Raymond Mercier and I should also mention Asmus Freytag in the place of Frédéric Grosshans.

 

--Jörg Knappen

 

Gesendet: Dienstag, 05. Januar 2016 um 10:10 Uhr
Von: "Jörg Knappen" <jknap...@web.de>
An: "Asmus Freytag (t)" <asmus-...@ix.netcom.com>
Cc: unicode@unicode.org
Betreff: Aw: Re: Turned Capital letter L (pointing to the left, with serifs)




I have looked up some printed sources and I agree with Michael Everson and Frédéric Grosshans that the

beast in question is a variant of the greek letter tau (capital or lowercase).

 

Here are the relevant sources I consulted:

 

Carl Faulmann: Das Buch der Schrift. Enthaltend die Schriftzeichen und Alphabete aller Zeiten und aller Völker des Erdkreises. Verlag der kaiserlich königlichen Staatsdruckerei. Wien 1878, 2. verm. und verb. Aufl. 1880 p.171

Hans Jensen: Die Schrift in Vergangenheit und Gegenwart, 3. Auflage p.459

 

Here is a quote from Hans Jensen:

   Noch in modernen Drucken finden wir die Formen ϐθϖ3ϲ7, wo andere βϑπζςτ haben.


Note: i had to fake the zeta symbol with a digit 3 and the tau symbol with a digit 7 here. In German typesetting tradition the theta symbol ϑ is the preferred form, not the straight theta θ.

 

My Opinion: The Greek Zeta Symbol and the Greek Tau Symbol are on the same footing as the "lunate sigma" alreay encoded in Unicode. They should be added in both lowercase and capital form.

 

--Jörg Knappen









Aw: Symbol for an upside down capital L, pointing to the right?

2016-01-04 Thread Jörg Knappen

Err... in what respect would this symbol be different from a CAPITAL GREEK LETTER GAMMA?

 

--Jörg Knappen

 

Gesendet: Freitag, 25. Dezember 2015 um 14:43 Uhr
Von: "Costello, Roger L." <coste...@mitre.org>
An: "unicode@unicode.org" <unicode@unicode.org>
Betreff: Symbol for an upside down capital L, pointing to the right?

Hi Folks,

Here is the upside down capital L, pointing to the left:

⅂ - TURNED SANS-SERIF CAPITAL L (U+2142)

Is there a symbol for an upside down capital L, pointing to the right?

/Roger
 





Turned Capital letter L (pointing to the left, with serifs)

2016-01-04 Thread Jörg Knappen
Here is a report of a rather strange beast occurring in historical math printing (work of C. F. Gauß) in thw 19th century:

 

http://tex.stackexchange.com/questions/284483/how-do-i-typeset-this-symbol-possibly-astronomical

 

images are here:

 

http://www.archive.org/stream/abhandlungenmet00gausrich#page/n129/mode/2up

http://i.stack.imgur.com/57fN3.png

 

It looks like a big digit "7" or like a turned letter "L". In the accepted answer it was identified with the Tironian note et; an identification

I'd dispute because the Tironian note Et is usually smaller in size than a capital latin letter.

 

Anyone knows what it is?

 

--Jörg Knappen


Aw: Re: Proposal for German capital letter "ß"

2015-12-10 Thread Jörg Knappen

Since the captial sharp s is easily available to the public, I see it popping up everywhere in

German publications, mostly in an all caps environment. I have a small collection of it (on paper).

 

The use of the capital sharp s in German is not only a historical artefact, it is recent and modern.

 

--Jörg Knappen

 

Martin Dürst wrote:



However, the example is also somewhat misleading. The book in the
picture is clearly quite old. The Duden that was cited is new. I checked
with "Der Grosse Duden" on Amazon, but all the books I found had the
officially correct spelling. On the other hand, I remember that when the
upper-case sharp s came up for discussion in Unicode, source material
showed that it was somewhat popular quite some time ago (possibly close
in age with the old Duden picture). So we would have to go back and
check the book in the picture to see what it says about ß to be able to
claim that Duden was (at some point in time) inconsistent with itself.

Regards, Martin.
 





Aw: New Character Property for Prepended Concatenation Marks

2015-11-26 Thread Jörg Knappen

I wonder how this concept relates to mathematical notation, especially the root sign.

 

--Jörg Knappen

 

Gesendet: Mittwoch, 25. November 2015 um 23:34 Uhr
Von: announceme...@unicode.org
An: announceme...@unicode.org
Betreff: New Character Property for Prepended Concatenation Marks



The Unicode Technical Committee is seeking feedback on a proposal to define a new character property for the class of prepended concatenation marks, also referred to as prefixed format control characters or, more generically, as subtending marks. Characters in that class include U+0600 ARABIC NUMBER SIGN and U+06DD ARABIC END OF AYAH. The new property, named Prepended_Concatenation_Mark and targeted for Unicode 9.0, would provide a mechanism to handle subtending marks collectively via properties rather than by hardcoded enumeration. A detailed description of the issue and how to provide feedback are given in Public Review Issue #310.

http://blog.unicode.org/2015/11/new-character-property-for-prepended.html
 






Aw: Re: Square Brackets with Tick

2015-08-24 Thread Jörg Knappen

I must admit, although I have seen really lots of mathematical notations, I have never encountered

those particular brackets. I have no intuition how they should pair.



--Jrg Knappen



Gesendet:Samstag, 22. August 2015 um 18:35 Uhr
Von:Julian Bradfield jcb+unic...@inf.ed.ac.uk
An:unicode@unicode.org
Betreff:Re: Square Brackets with Tick

On 2015-08-22, Nigel Small ni...@nigelsmall.com wrote:
 298D; 2990; o # LEFT SQUARE BRACKET WITH TICK IN TOP CORNER
 298E; 298F; c # RIGHT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
 298F; 298E; o # LEFT SQUARE BRACKET WITH TICK IN BOTTOM CORNER
 2990; 298D; c # RIGHT SQUARE BRACKET WITH TICK IN TOP CORNER

 with several code points in between. According to the code point pairs in
 the first and second columns of this file, these particular brackets should
 be paired as the *first and fourth* and the *third and second*. Intuitively
 however, these would actually be *first and second* and *third and fourth*
 if one is to expect consistency.

Thats a strange intuition! Mathematical brackets are expected to pair
with left-right symmetry, not rotational symmetry. As in, for example,
floor and ceiling brackets. The pairing in the file is the natural one.

 1. The current pairing information is correct and the sequence is irregular
 for some historical reason

That will be the explanation. There is no inherent meaning to the
order of codepoints, its just convenience.
One of the experts here can probably tell us why these four brackets
happen to be coded in this order.

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.






Aw: Re: Bunny hill symbol, used in America for signaling ski pistes for novices

2015-05-29 Thread Jörg Knappen

From the description of the symbol it looks like a geometric shape. I think it is worth to be encoded as a geometric shape (TWO BLACK DIAMONDS VERTICALLY STACKED or something like this) with a note * bunny hill. It may have (r find in future) other uses.



--Jrg Knappen



Gesendet:Donnerstag, 28. Mai 2015 um 23:20 Uhr
Von:Shervin Afshar shervinafs...@gmail.com
An:Shawn Steele shawn.ste...@microsoft.com
Cc:verd...@wanadoo.fr verd...@wanadoo.fr, unicode Unicode Discussion unicode@unicode.org, Jim Melton jim.mel...@oracle.com
Betreff:Re: Bunny hill symbol, used in America for signaling ski pistes for novices


Since the double-diamond has map and map legend usage, it might be a good idea to have it encoded separately. I know that Im stating the obvious here, but the important point is doing the research and showing that it has widespread usage.




 Shervin




On Thu, May 28, 2015 at 2:15 PM, Shawn Steele shawn.ste...@microsoft.com wrote:




Im used to them being next to each other. So the entire discussion seems to be about how to encode a concept vs how to get the shape you want with existing code points. If you just want the perfect shape, then maybe an svg is a better choice. If were talking about describing ski-run difficulty levels in plain-text, then the hodge-podge of glyphs being offered in this thread seems kinda hacky to me.



-Shawn



From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 2:12 PM
To: Jim Melton
Cc: Shawn Steele; unicode Unicode Discussion
Subject: Re: Bunny hill symbol, used in America for signaling ski pistes for novices






Some documentations also suggest that the two diamonds are not stacked one above the other, but horizontally. Its a good point for using only one symbol, encoding it twice in plain-text if needed.






2015-05-28 22:15 GMT+02:00 Jim Melton jim.mel...@oracle.com:



I no longer ski, but I did so for many years, mostly (but not exclusively) in the western United States. I never encountered, at any USA ski hill/mountain/resort, a special symbol for bunny hills, which are typically represented by the green circle meaning beginner. Thats anecdotal evidence at best, but my observations cover numerous skiing sites. I have encountered such a symbol in Europe and in New Zealand, but not in the USA. (I have not had the pleasure of skiing in Canada and am thus unable to speak about ski areas in that country.)

The double black diamond would appear to be a unique symbol worthy of encoding, simply because the only valid typographical representation (in the USA) is two single black diamonds stacked one above the other and touching at the points. 

Hope this helps,
 Jim




On 5/28/2015 2:04 PM, Shawn Steele wrote:



So is double black diamond a separate symbol? Or just two of the black diamond?



And Blue-Black?



Im drawing a blank on a specific bunny sign, in my experience those are usually just green.



Arent there a lot of cartography symbols for various systems that arent present in Unicode? 



From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 12:47 PM
To: unicode Unicode Discussion
Subject: Bunny hill symbol, used in America for signaling ski pistes for novices




Is there a symbol that can represent the Bunny hill symbol used in North America and some other American territories with mountains, to designate the ski pistes open to novice skiers (those pistes are signaled with green signs in Europe).






Im looking for the symbol itself, not the color, or the form of the sign.







For example blue pistes in Europe are designed with a green circle in America, but we have a symbol for the circle; red pistes in Europe are signaled by a blue square in America, but we have a symbol for the square; black pistes in Europe are signaled by a black diamond in America, but we also have such black diamond in Unicode.







But I cant find an equivalent to the American Bunny hill signal, equivalent to green pistes in Europe (this is a problem for webpages related to skiing: do we have to embed an image ?).













-- 



Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144

 Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG Fax : +1.801.942.3345

Oracle Corporation Oracle Email: jim dot melton at oracle dot com

1930 Viscounti Drive Alternate email: jim dot melton at acm dot org

Sandy, UT 84093-1063 USA Personal email: SheltieJim at xmission dot com



= Facts are facts. But any opinions expressed are the opinions =

= only of myself and may or may not reflect the opinions of anybody =

= else with whom I may or may not have discussed the issues at hand. =

 



















Aw: Combining character example

2015-04-16 Thread Jörg Knappen

Hi Mark,



the use of DOT BELOW and LINE BELOW is in fact consistent in German Duden. The

difference in the diacritics is used to denote length of the stressed vowel, DOT BELOW

denotes a short vowel and LINE BELOW denotes a long vowel.



Diphthongs are always long and there is a single line under the whole Diphthong.



Digraphs (e.g. the ou in words borrowed from French) also have either a single line

under the whole digraph or (this happens rarely) a single dot in the middle of the

digraph.




--Jrg Knappen




Gesendet:Donnerstag, 16. April 2015 um 10:01 Uhr
Von:Mark Davis  m...@macchiato.com
An:Unicode Public unicode@unicode.org, Unicode Book b...@unicode.org
Betreff:Combining character example




I happened to run across a good example of productive use of combining marks, the Duden site (a great online dictionary for German).They use U+0323 (  ) COMBINING DOT BELOW to indicate the stress.Here is an example:



unterbuttern






http://www.duden.de/rechtschreibung/unterbuttern



They arent, however, consistent; you also see underlining for stress.





	einschrnken


But not, interestingly, with the HTML underline, but withU+0332 (  ) COMBINING LOW LINE.













Mark














Looking for a standard on historical countries

2014-10-31 Thread Jörg Knappen
Sorry for this off-topic question:



Does someone here is aware of a standard or a de facto standard for names or codes of historical countries? For the requirement I have in mind, all countries where there was a printing press would be optimal coverage, anything going beyond 1974 (ISO 3166-3) will be better than nothing.



The Getty Thesaurus of Geographical Names (TGN;http://www.getty.edu/research/tools/vocabularies/tgn/ ) covers some historical countries (e.g., Preussen), but is far from being complete (missing, e.g., Schaumburg-Lippe). The same holds for the MARC Code List for Countries ( http://www.loc.gov/marc/countries/countries_name.html ).



Thanks,



Jrg Knappen


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Aw: Ambiguous hyphenation cases with

2014-07-24 Thread Jörg Knappen

With TeX and LaTeX there is an elegant solution.



TeX has the primitive discretionary{prebreak}{postbreak}{nobreak}, which spells out like

discretionary{t-}{}{}

for the insertion of an additional t at hyphenation. It also handles cases like the traditional german hyphenation

of ck as k-k with

dicscretionary{c-}{}{k}




The Babel system (inspired by german.sty) includes nifty shorthands like t and c for this cases.



The semantics of U+00AD (SOFT HYPHEN) is too primitive to implement this kind of behaviour, the same is true for shy; in HTML.



--Jrg Knappen




Gesendet:Dienstag, 22. Juli 2014 um 16:03 Uhr
Von:fantasai fantasai.li...@inkedblade.net
An:Hkan Save Hansson hakan.hans...@edison.se, www-st...@w3.org www-st...@w3.org, Unicode unicode@unicode.org
Betreff:Ambiguous hyphenation cases with

On 05/12/2014 12:43 AM, Hkan Save Hansson wrote:
 Hi fantasai,

 Regarding your answer to my second suggestion (if you are referring
 to James Clarks first answer):

 The problem is that the hyphenation system in itself cant decide how
 to change the spelling, without any dictionary functionality. It
 cant know if I meant mat-tjuv (food thief in Swedish) or matt-tjuv
 (carpet thief) when I wrote matshy;tjuv. So there has to be a way
 to tell the hyphenation system that.

Hm. I dont think I have a solution for that problem. :/ Currently youd
just have to not hyphenate that word.

CCing Unicode, in case anyone there has a solution

Up-reference: http://lists.w3.org/Archives/Public/www-style/2014Feb/0739.html

~fantasai
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Aw: New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol

2014-03-21 Thread Jörg Knappen

Even when this symbol really catches on (what I doubt because it is too close to the @ sign in the first place) chance are low that it will be encoded in UNicode. Precedents like the Creative Commons sign or the Copyleft sign have been discussed on this mailing list (search the archives for the relevant threads) but were never encoded in UNicode.



When the symbol does not catch on, why should it be encoded in UNicode?



--Jrg Knappen



Gesendet:Freitag, 21. Mrz 2014 um 12:14 Uhr
Von:Jan Velterop velte...@gmail.com
An:unicode@unicode.org
Betreff:New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol

May I propose a new Unicode symbol to denote true open access, for instance applied to scholarly literature, in a similar way that  and  denote copyright and registered trademarks respectively? The proposed symbol is an encircled lower case letter a, in particular in a font where the a has a tail, as in a font like Arial, for instance, and not as in a font like Century Gothic.

A sketch of what I have in mind is here: http://theparachute.blogspot.co.uk/2014/03/proposed-open-access-symbol.html

The intended use would be for documents and images that have been published with so-called BOAI-compliant open access (http://www.budapestopenaccessinitiative.org/read), meaning that all reuse is permitted, with the only permissible condition that the author(s) should be acknowledged (CC_BY licence: http://creativecommons.org/licenses/by/4.0/). This condition would not be mandatory, and also public domain, CC-0 licences would be denoted by the proposed symbol (http://creativecommons.org/publicdomain/zero/1.0/)

I am seeking comments and support for this proposal.

Jan Velterop
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Aw: Re: [Unicode] two Hanzi

2014-03-20 Thread Jörg Knappen


Just to keep the chemistry straight: Nebulium and Coronium were conjectured

elements at the end of the 19th century and beginning of the 20th century when

the atomic number that classifies the elements was not yet known.



They have their place in spectroscopic literature and astrophysics; but the spectral lines

associated with them are now identified as so called forbidden lines of well known elements.



Nevertheless, the characters for them are certainly used (as the english names; Nebulium

even has a Wikipedia and a Britannica entry) and a legitimate addon to UNicode.

Who writes a proposal?



--Jrg Knappen




Gesendet:Donnerstag, 20. Mrz 2014 um 14:50 Uhr
Von:suzuki toshiya mpsuz...@hiroshima-u.ac.jp
An:shi zhao shiz...@gmail.com
Cc:unicode@unicode.org
Betreff:Re: [Unicode] two Hanzi

If they are officially standardized characters for the
elements by PRC government, China NB will submit them
to ISO/IEC 10646 via Urgently Needed Characters process.
They are official?

Regards,
mpsuzuki

On 03/20/2014 10:36 PM, shi zhao wrote:
 plese add two Hanzi (up + down ) and (up  + down )

 see http://www.term.org.cn/CN/abstract/abstract9314.shtml#

 include in :
 * Zhonghua Zihai, 1994: 1770.
 * Lu gusun, The English-Chinese Dictionary (), 1991: 701,2219.


 (up + down ) = nebulium (see http://yedict.com/zslistbs.asp?word=%C6%F84 )
 (up  + down ) = coronium = newtonium (see
 http://yedict.com/zslistbs.asp?word=%C6%F87 )



 My blog: http://shizhao.org
 twitter: https://twitter.com/shizhao

 [[zh:User:Shizhao]]

 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Aw: Astrological symbol for Pluto?

2014-02-03 Thread Jörg Knappen

Unfortunately,



this astrological symbol is given in the Wikipedia article, but not sourced. So I think, further evidence for its usage is needed.



--Jrg Knappen



Gesendet:Sonntag, 02. Februar 2014 um 05:20 Uhr
Von:Shriramana Sharma samj...@gmail.com
An:UnicoDe List unicode@unicode.org
Betreff:Astrological symbol for Pluto?



Currently Unicode encodes a distinct astrological symbol for Uranus 2645  vs an astronomical symbol 26E2 .

However the only symbol encoded for Pluto is the astronomical one: 2647 . Just now I learnt from https://en.wikipedia.org/wiki/Pluto#Name that there is a distinct astrological symbol:


Has there been any proposal to encode this? (Im guessing Michael might be interested...)


--
Shriramana Sharma  

___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Aw: Re: Astrological symbol for Pluto?

2014-02-03 Thread Jörg Knappen

In fact, several dwarf planets have _astronomical_ symbols that were published together with their

official names in the relevant astronomical journals. When it became clear that there are too many

minor planets around, the assignment of symbols was halted. (37) Fides was the last minor planet

to receive a symbol.



Most of the symbols are already available in UNicode.



For a quick reference, see http://en.wikipedia.org/wiki/Astronomical_symbols



--Jrg Knappen



P.S. I have also seen astro_l_ogical symbols for some of the Kuyper belt objects, but there seems to

be no agreement between different authors.



Gesendet:Montag, 03. Februar 2014 um 14:14 Uhr
Von:Shriramana Sharma samj...@gmail.com
An:Frdric Grosshans frederic.grossh...@gmail.com
Cc:UnicoDe List unicode@unicode.org
Betreff:Re: Aw: Astrological symbol for Pluto?




On Mon, Feb 3, 2014 at 4:15 PM, Frdric Grosshans frederic.grossh...@gmail.com wrote:





Actually, it is sourced (with the other symbils) to http://www.uranian-institute.org/bfglyphs.htm , which lists no less than 4 symbols for Pluto...



In any case, it seems its astronomical symbol was encoded quite early (DerivedAge = 1.1) which was before the 2006 IAU decision to demote it to dwarf planet status. Of course, even if it were encoded today Im sure it would be the only dwarf planet to have a symbol encoded since no other dwarf planet has captured the common mans imagination (and basic knowledge) like Pluto, and I have not heard any of the other dwarf planets (Ceres, Haumea, Makemake and Eris) having any symbols...


--
Shriramana Sharma  

___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Aw: Re: Re: Re: Re: Re: Re: Do you know a tool to decode UTF-8 twice

2014-01-30 Thread Jörg Knappen

When you are looking for a *new* name for that encoding, why dont you just adopt the pythonese precedent

mysql-latin1 ? It is as good or as bad as any other name, but has some footing just now.



--Jrg Knappen



Gesendet:Mittwoch, 29. Januar 2014 um 21:12 Uhr
Von:Anne van Kesteren ann...@annevk.nl
An:Buck Golemon b...@yelp.com
Cc:Markus Scherer markus@gmail.com, Jrg Knappen jknap...@web.de, Frdric Grosshans frederic.grossh...@gmail.com, unicode unicode@unicode.org, unic...@norbertlindenberg.com
Betreff:Re: Re: Re: Re: Re: Re: Do you know a tool to decode UTF-8 twice

On Wed, Jan 29, 2014 at 11:57 AM, Buck Golemon b...@yelp.com wrote:
 Anne: Given that the intent is to implement exactly the whatwg spec, and the
 group is currently called whatwg (even though it may eventually become a
 historical artifact), is whatwg-1252 most appropriate?

Its up to you I suppose, but whatwg-1252 just seems like long term
it will lose its meaning. For the web windows-1252 will always have
this meaning due to deployed content, so web-windows-1252 if you
need to disambiguate from a different implementation of windows-1252
makes sense to me.


--
http://annevankesteren.nl/



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Aw: Re: Re: Re: Re: Re: Do you know a tool to decode UTF-8 twice

2014-01-29 Thread Jörg Knappen

A little postscrptum to this old thread:



On pyPi, there is now a codec available that handles the peculiar definition of latin1 inside mysql.

The package is called mysql-latin1-codec and features an encoding consisting of cp1252 plus

0x81, 0x8D, 0x8F, 0x90, 0x9D (the latter five characters are undefined in the python codec for cp1252).




https://pypi.python.org/pypi/mysql-latin1-codec/1.0



--Jrg Knappen




Gesendet:Mittwoch, 30. Oktober 2013 um 19:14 Uhr
Von:Buck Golemon b...@yelp.com
An:Frdric Grosshans frederic.grossh...@gmail.com
Cc:Jrg Knappen jknap...@web.de, unicode unicode@unicode.org
Betreff:Re: Aw: Re: Re: Re: Re: Do you know a tool to decode UTF-8 twice




On Wed, Oct 30, 2013 at 9:56 AM, Frdric Grosshans frederic.grossh...@gmail.com wrote:

Le 30/10/2013 17:32, Jrg Knappen a crit :

The data did not only contain latin-1 type mangling for the non-existent Windows characters, but also sequences with the raw
C1 control characters for all of latin-1. So I had to do them, too.
The data werent consistent at all, not even in their errors.
--Jrg Knappen

Your question helped me dust off and repair a non working python snippet I wrote for a similar problem. I was stuck with the mixing of windows-1252 and latin1 controls (linked with a chinese characters). I write it below for reference.

The python snippet below does not need sed, defines a function (unscramble(S)) which works on strings. The extension to files should be easy.

  Frdric Grosshans


def Step1Filter(S):
  for c in S :
  #works character/character because of the cp1252/latin1 ambiguity
try :
  yield c.encode(cp1252)
except UnicodeEncodeError :
  yield c.encode(latin1)
  #Useful where cp1252 is undefined (81, 8D, 8F, 90, 9D)

def unscramble(S):
  return b.join(c for c in Step1Filter(S)).decode(utf8)

PS: If anyone is interested in a licence, I consider this simple enough to be in the public domain an uncopyrightable.




This encoding youve implemented above is known as windows-1252 by the whatwg and all browsers [1][2].
The implementation of cp1252 in python is instead a direct consequence of the unicode.org definition [3].



[1]http://encoding.spec.whatwg.org/index-windows-1252.txt

[2]http://bukzor.github.io/encodings/cp1252.html

[3]http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT






___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Aw: Commercial minus as italic variant of division sign in German and Scandinavian context

2014-01-16 Thread Jörg Knappen

The most important word in the comment on 00F7  DIVISION SIGN is occasionally.



In fact, the occasions are such rare that you can live a whole life in germany

without encountering one of them.



On the other hand, 00F7  DIVISION SIGN is used _frequently_ in german schoolbooks to denote ...

division (books aimed at professionals doing math prefer : (COLON) or / (SLASH) for this

purpose, but schoolbooks dont).



2052  commercial minus sign _always_ means subtraction and it has this shape (or the alternate shape ./.)

in all contexts, roman or italic. It is not the italic version of some other symbol.



Hope this helps,



Jrg Knappen





Gesendet:Donnerstag, 16. Januar 2014 um 04:43 Uhr
Von:Leif Halvard Silli xn--mlform-iua@mlform.no
An:unicode@unicode.org
Betreff:Commercial minus as italic variant of division sign in German and Scandinavian context

Thanks to our discussion in July 2012,[1] the Unicode code charts now
says, about 00F7  DIVISION SIGN, this:

 occasionally used as an alternate, more visually
distinct version of 2212  {MINUS SIGN} or 2011 
{NON-BREAKING HYPHEN} in some contexts
[ snip ]
 2052  commercial minus sign

However, I think it can also be added somewhere that commercial minus
is just the italic variant of division minus. Ill hereby argue for
this based on an old German book on commercial arithmetics I have
come accross, plus what the the July 2012 discussion and what Unicode
already says about the commercial sign:

FIRST: IDENTICAL CONTEXTS.

German language is an important locale for the Commercial Minus. In
German, the Commercial minus is both referred to as kaufmnnische
Minus(zeichen) and as buchhalterische Minus (Commercial Minus
Character and Bookkeeper Minus). And, speaking of division minus
in the context I know best, Norway, we find it in advertising
(commercial context) and in book keeping documentation and taxation
forms. Simply put, what the Unicode 6.2 General Punctuation section
says about Commercial Minus, can also be said about DIVISION SIGN used
as minus: U+2052 % commercial minus sign is used in commercial or tax
related forms or publications in several European countries, including
Germany and Scandinavia. So, basically and for the most part, the
commercial minus and the division sign minus occur in the very same
contexts, with very much the same meaning. This is a strong hint that
they are the same character.

SECOND: GERMAN USE OF DIVISION SIGN FOR MINUS IN COMMERCIAL CONTEXT.

Is there any proof that German used both an italics variant and a
non-italics variant of the division minus? Seemingly yes. The book
Kaufmnnische Arithmetik (Commercial arithmetics) from 1825 by
Johann Philipp Schellenberg. By reading section 118 Anhang zur
Addition und Subtraction der Brche [Appendix about the addition and
subtraction of fractions]) at page 213 and onwards,[2] we can conclude
that he describes as commercial use of the  division minus, where
the  signifies a _negative remainder_ of a division (while the plus
sign is used to signify a positive remainder). Or to quote, from page
214: so wird das Fehlende durch das [Zei]chen  (minus) bemerkt, und
bei Berechn[nung der Preis der Waare abgezogen [then the lacking
remainder is marked with the  (minus) and withdrawn when the price of
the commodity is calculated]. {Note that some bits of the text are
lacking, I marked my guessed in square brackets.} I did not find (yet)
that he used the italic commercial minus, however, the context is
correct. (My guess is that the italics variant has been put to more
use, in the computer age, partly to separate it from the DIVISION SIGN
or may be simply because people started to see it often in handwriting
but seldom in print. And so would not have recognized it in the form of
the non-italic division sign.)

THIRD: IDENTICAL INTERPRETATION

The word abgezogen in the above quote is interesting since the Code
Charts for 2052  COMMERCIAL MINUS cites the related German word
abzglich. And from the Swedish context, the charts quotes the
_expression_ med avdrag. English translation might be to be withdrawn
or with subtraction/rebate [for]. Simply put, we here see the
commercial meaning.

WHAT ABOUT COMMERCIAL MINUS AS CORRECT SIGN IN SCANDINAVIAN SCHOOLS?

UNICODE 6.3 notes that in some European (e.g. Finnish, Swedish and
perhaps Norwegian) traditions, teachers use the Commercial Minus Sign
to signify that something is correct (whereas a red check mark is used
to signify error). If my theory is right, that commercial minus and
division sign minus are the same signs, how on earth is that possible?
How can a minus sign count as positive for the student?

The answer is, I think, to be found in the Code Charts Swedish
description (med avdrag/with subtraction/rebate). Because, I think
that the correct understanding is not that it means correct or OK.
Rather, it denotes something that is counted in the customer/students
favor. So, you could say it it really means slack, or rebate. So
it really mans 

Aw: Re: Re: Re: Do you know a tool to decode UTF-8 twice

2013-10-30 Thread Jörg Knappen

Thanks again!



My updated sed pattern generator now looks like:






r = range(0xa0, 0x170)
file = open(fixu8.sed, w)
for i in r:
 pat1 = s/+unichr(i).encode(utf-8).decode(latin-1).encode(utf-8) + / + unichr(i).encode(utf-8) +/g
 print file, pat1
 try:
 pat2 = s/+unichr(i).encode(utf-8).decode(windows-1252).encode(utf-8) + / + unichr(i).encode(utf-8) +/g
 except:
 pat2 = pat1
 if (pat1 != pat2):
 print file, pat2



doing both latin-1 and windows-1252 mangled double utf-8. This is probably enough for now, the rate of errors is low

enough for practical purposes (i.e., lower than the natural error rate introduced by typing errors)



--Jrg Knappen




Gesendet:Mittwoch, 30. Oktober 2013 um 15:34 Uhr
Von:Frdric Grosshans frederic.grossh...@gmail.com
An:unicode@unicode.org
Betreff:Re: Aw: Re: Re: Do you know a tool to decode UTF-8 twice

Le 29/10/2013 17:15, Jrg Knappen a crit :
 After running this script, a few more things were there:
 Non-normalised accents and some really strange
 encodings I could not really explain but rather guess their meanings, like
 s///g
 s///g
 s/A//g
 s/a//g
 s/E//g
 s/e//g
 s///g
 s///g
 s///g
 s///g
 s///g

It was probably not utf8 read as latin 1 and reencoded in utf8, but
utf_8 encoding read as Windows 1252 (
http://en.wikipedia.org/wiki/Windows-1252 ) and reencoded as utf-8. Each
of the combination above contains a character absent in latin-1
(), and some of them are only present in Windows-1252 () and
not in Latin-15, the other possible mistake.

Iv e check that this is consistent with   and  but not with your .
This double encoding would give  :
=Win1252(C3 84)=110.00011 10.000100 = UTF8(00011 000100)=unicode 00C4
= (and not )

Frdric








Aw: Re: Re: Re: Re: Do you know a tool to decode UTF-8 twice

2013-10-30 Thread Jörg Knappen

The data did not only contain latin-1 type mangling for the non-existent Windows characters, but also sequences with the raw

C1 control characters for all of latin-1. So I had to do them, too.



The data werent consistent at all, not even in their errors.



--Jrg Knappen



Gesendet:Mittwoch, 30. Oktober 2013 um 16:58 Uhr
Von:Frdric Grosshans frederic.grossh...@gmail.com
An:Jrg Knappen jknap...@web.de
Cc:unicode@unicode.org
Betreff:Re: Aw: Re: Re: Re: Do you know a tool to decode UTF-8 twice

Le 30/10/2013 16:13, Jrg Knappen a crit :
 Thanks again!
 My updated sed pattern generator now looks like:
 r = range(0xa0, 0x170)
 file = open(fixu8.sed, w)
 for i in r:
 pat1 =
 s/+unichr(i).encode(utf-8).decode(latin-1).encode(utf-8) + /
 + unichr(i).encode(utf-8) +/g
 print file, pat1
 try:
 pat2 =
 s/+unichr(i).encode(utf-8).decode(windows-1252).encode(utf-8)
 + / + unichr(i).encode(utf-8) +/g
 except:
 pat2 = pat1
 if (pat1 != pat2):
 print file, pat2
 doing both latin-1 and windows-1252 mangled double utf-8. This is
 probably enough for now, the rate of errors is low
 enough for practical purposes (i.e., lower than the natural error rate
 introduced by typing errors)

Why to you do both latin1 and windows-1252 ? Windows-1252 is supposed to
be a superset of latin1, so it should be enough. Or is there a problem
with the few undefined bytes of windows-1252 (81, 8D, 8F, 90, 9D) ?


Frdric







Do you know a tool to decode UTF-8 twice

2013-10-28 Thread Jörg Knappen
I have a database with broken encoding, containing a lot of UTF-8 twice

(that infamous encoding that arises when UTF-8 is interpreted as latin-1 and

converted to UTF-8 again) encoding besides ASCII and UTF-8 proper.



Is there a ready made tool that decodes UTF-8 twice while keeping UTF-8 proper in place?



--Jrg Knappen



Aw: Re: Do you know a tool to decode UTF-8 twice

2013-10-28 Thread Jörg Knappen

Hi Steffen,



data arent that easy. There are non-latin1-characters encoded in the UTF8 part. I expect

among others typographic apostrophes, polish characters, some mediaevalist characters like

 (u with tilde). Maybe, there is also some greek inside, but I am not sure about that.



--Jrg Knappen



Gesendet:Montag, 28. Oktober 2013 um 12:34 Uhr
Von:Steffen Daode Nurpmeso sdao...@gmail.com
An:Jrg Knappen jknap...@web.de
Cc:unicode@unicode.org
Betreff:Re: Do you know a tool to decode UTF-8 twice

Jrg Knappen jknap...@web.de wrote:
 Is there a ready made tool that decodes UTF-8 twice while keeping
 UTF-8 proper in place?

Isnt a shell script with a truly validating iconv(1) enough?
This works for me if in utf8.1 there is EI in UTF-8 and i run

?0[steffen@sherwood tmp] iconv -f latin1 -t utf8  utf8.1  utf8.2

As in

for i in utf8.1 utf8.2; do
if iconv -f utf8 -t latin1  {i} 
iconv -f utf8 -t utf8 /dev/null 21; then
echo {i}: bummer, going home by one
iconv -f utf8 -t latin1  {i}  {i}.new 21
else
echo {i}: valid UTF-8
fi
done

ill end up as

?0[steffen@sherwood tmp] sh utf8dec.sh
utf8.1: valid UTF-8
utf8.2: bummer, going home by one
?0[steffen@sherwood tmp]

Ciao,

 --Jrg Knappen

--steffen






Aw: Re: symbols/codepoints for necessity and possibility in modal logic

2013-07-19 Thread Jörg Knappen

I think,



U+25C7 WHITE DIAMOND

is the best choice, followed by

U+27E1 WHITE CONCAVE-SIDED DIAMOND  never (modal operator)



The latter has a more fancy shape and might not be the one the reader expects. As a plus, it comes also with versions having right and left ticks, needed in some extensions of modal logic. I couldnt locate WHITE DIAMOND WITH LEFTWARDS TICK in UNicode.



(U+2662 WHITE DIAMOND SUIT would also look OK, but I think this is symbol abuse. Can be used as a fallback when the font of choice has this one, but none of the two above.)



For the properties of mathematical symbols, see also

http://www.unicode.org/reports/tr25/


---but I have to admit that the report does not answer the specific question posed here.



Maybe this mapping table is more useful (but harder to read):

http://www.w3.org/Math/characters/unicode.xml



--Jrg Knappen



P.S. Id consider U+22C4 DIAMOND OPERATOR as wrong because it is used as a binary operator which has a very different

spacing than the unary modal operator needed here.




Gesendet:Freitag, 19. Juli 2013 um 09:43 Uhr
Von:Stephan Stiller stephan.stil...@gmail.com
An:Unicode Public unicode@unicode.org
Betreff:Re: symbols/codepoints for necessity and possibility in modal logic




What is wrong with using DIAMOND OPERATOR?

wrong is strong wording and goes beyond what I suggested or implied, but its not clear to a user of Unicode that its the right fit either. There are a couple of indicators factoring in:


	The charts mention modal logic in conjunction with  (U+25FB) and  (U+27E0) but not with  (U+22C4).
	The glyph in the code charts is tiny (and that of Cambria Math is tiny as well). Typographically you see various things (a lozenge, fallback to letter-M) in esp older books, but it feels like its meant to be an orthogonal diamond of perhaps slightly less area than the box but descending a little above and below the box, which is somewhat taller than x-height. The book by {Blackburn, de Rijke, Venema} has glyphs that look right. This is more than a guess: it makes sense if they have similar visual weight, as they are  literally  defined to be duals of one another; but whether you can make them geometrically congruent symbols of equal area I havent tested (this might have the diamond ascend too far).
	The vague notion of operator (a word with different meanings in math, from logical relation to [non-logical/non-relational] mapping of type AAA or perhaps AAB to (linear) map (between say vector spaces) in linear algebra) in this context (in the code charts) seems to refer to something like my middle meaning, which is likely to use a smaller symbol around x-height in placement and dimensions.
	The glyph of  (U+2B26) seems to have a more appropriate name, but in the charts I like  U+25C7. The differently sized square-like symbols are hard to semantically tell apart in/from the charts anyway.
	These symbols are the first two visually distinct ones you define in modal logic, so theyre well-known and standardized in meaning for anyone who had had contact with the field. Its surprising theyre not explicitly named in the charts. (Theres stuff like the outdated horseshoe for logical implication popping up in the relevant books, but that is a leftover or outdated logic notation in general.) So for box and diamond its quite reasonable to be expecting a standard math font to provide them just right out of the box; for whatever commonly used box-like symbols in math there are, one would assume that there are corresponding codepoints; otherwise youd have to choose a different font.



Stephan









Aw: Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-06-21 Thread Jörg Knappen
My opinion on the cedilla mess is the following:



* Add preemptively LATIN [CAPITALLOWERCASE] LETTER * WITH CEDILLA ATTACHED for every Latvian/Livonian character currently in UNicode. (Dont use terms like MARSHALLESE [CAPITALLOWERCASE] LETTER [MN] -- such entities dont exist from a character encoding point of view.)



* Declare the list of exceptions to Cedilla rendering officially closed. Whenever another such thing (say, LATIN CAPITAL LETTER P WITH COMMA BELOW / LATIN LOWERCASE LETTER P WITH TURNED COMMA ABOVE) occurs in real life, it will be encoded ... WITH COMMA BELOW.



* Font design is an entirely different field. Original german font designs differ from french or anglo-american ones in several aspects, and original marshallese font desgns will be different, too. I see no problem here. I doubt that one size fits it all is the right way to tackle font design.



--Jrg Knappen



Aw: Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-06-21 Thread Jörg Knappen
Micheal Everson schrieb:

 My opinion on the cedilla mess is the following:

 * Add preemptively LATIN [CAPITAL|LOWERCASE] LETTER * WITH CEDILLA ATTACHED 
 for every Latvian/Livonian character currently in UNicode.

 Why? Latvian and Livonian don't use letters with proper cedilla attached.

Maybe my english wasn't perfect here; of course I think that for writing 
Latvian the existing characters shall be used. I meant for in the sense of 
foreach or for loop in programming languages. And yes, I think not only the 
four character required for marshallese, but also the other ones (g, k, and r).

 (Don't use terms like MARSHALLESE [CAPITAL|LOWERCASE] LETTER [M|N] -- such 
 entities don't exist from a character encoding point of view.)

Yes they do. Cf. U+0406 CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I. The 
character name exists to distinguish it from other characters and to guide the 
user in the character's use.

But that character exists as a base letter with a distinct shape. There is no 
distinct base letter marshallese m or n.

 * Declare the list of exceptions to Cedilla rendering officially closed. 
 Whenever another such thing (say, LATIN CAPITAL LETTER P WITH COMMA BELOW / 
 LATIN LOWERCASE LETTER P WITH TURNED COMMA ABOVE) occurs in real life, it 
 will be encoded ... WITH COMMA BELOW.

 I think that is understood, but where would you declare this?

In the explanatory notes in the introduction to the standard. I don't have the 
book here to suggest a more exact location in the moment.

--Jörg Knappen

Michael Everson * http://www.evertype.com/


 




Aw: Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-06-21 Thread Jörg Knappen
Micheal Everson schrieb:

 * Add preemptively LATIN [CAPITAL|LOWERCASE] LETTER * WITH CEDILLA 
 ATTACHED for every Latvian/Livonian character currently in UNicode.

 Why? Latvian and Livonian don't use letters with proper cedilla attached.

 Maybe my english wasn't perfect here; of course I think that for writing 
 Latvian the existing characters shall be used. I meant for in the sense of 
 foreach or for loop in programming languages.

I have no idea what that means. You want to add a bunch of new non-decomposed 
characters with a proper cedilla… why?

 And yes, I think not only the four character required for marshallese, but 
 also the other ones (g, k, and r).

Why?

The first reason is to solve this problem completely and not only to resolve a 
Latvian-Marshallese conflict and leave some other exceptions for later.

The second reason is that the letter g, k, l, m, r with proper cedillas are 
currently not encodable using UNicode (because of the latvian exceptions and 
canonical composition/decomposition), but they should *obviously* be encodable.

 (Don't use terms like MARSHALLESE [CAPITAL|LOWERCASE] LETTER [M|N] -- such 
 entities don't exist from a character encoding point of view.)

 Yes they do. Cf. U+0406 CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I. 
 The character name exists to distinguish it from other characters and to 
 guide the user in the character's use.

 But that character exists as a base letter with a distinct shape. There is 
 no distinct base letter marshallese m or n.

There is no decomposition. There is no base character + diacritic. The whole 
thing is a letter used in Marshallese. (It's just a name.)

Allthough there is the famous Goethe quote Namen sind Schall und Rauch I 
think good naming style matters, and I prefer the descripte style LATIN CAPITAL 
LETTER L WITH PROPER CEDILLA (marshallese) to the ad-hoc style
LATIN CAPITAL LETTER MARSHALLESE LETTER L WITH  CEDILLA. But this is a question 
of style and can be debatted endlessly without consensus.

--Jörg Knappen

Michael Everson * http://www.evertype.com/


 




Aw: Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-06-21 Thread Jörg Knappen
Dominikus Dittes Scherkl schrieb:

Why not instead encoding a new combining MARSHALLESE CEDILLA
that ought to be used with g, k, l, m, r and their uppercase counterparts?

This is not a good idea, because the combining MARSHALLESE CEDILLA can be 
combined with the letter C, too.
This creates all kind of havoc with the Ç (including fake internationalised 
domain names). The remaining letters
with cedilla need to be precomposed and non-decomposable.

--Jörg Knappen




Aw: Re: Missing geometric shapes

2012-11-08 Thread Jörg Knappen

Also the asymmetric geometric shapes dont have the mirror-property (it is restricted to parentheses and mathematical operators). Thats the reason why I have proposed two characters instead of only one. Adding the mirror property to the bicolor staronly would violate the minimum surprise principle.--Jrg Knappen

Gesendet:Donnerstag, 08. November 2012 um 11:15 Uhr
Von:Michael Everson ever...@evertype.com
An:Unicode Discussion unicode@unicode.org

Betreff:Re: Missing geometric shapes


On 8 Nov 2012, at 09:59, Simon Montagu smont...@smontagu.org wrote:

 Please take into account that the half-stars should be symmetric-swapped in RTL text. I attach an example from an advertisment for a movie published in Haaretz 2 November 2012

I dont think Geometric Shapes have the mirror property. 

2605;BLACK STAR;So;0;ON;N;
2606;WHITE STAR;So;0;ON;N;

In a Hebrew context youd just choose the star you wanted (black-white vs white-black) and use it. 

Michael Everson * http://www.evertype.com/









Missing geometric shapes

2012-11-06 Thread Jörg Knappen
Hi,after a long time of absence I drop in again.The reason is that I just was trying to show the rating on a webpage using the popular of 1 to 5 starts including half-coloured starts just usingUNicode characters.But: There is no character BLACK AND WHITE STAR in UNicode yet.So should the following two characters be added to the Geometric shapes block:BLACK AND WHITE STARWHITE AND BLACK STAR?For the purpose I have in mind, it is not really crucial whether the stars(five pointed, of course) are divided vertically or diagonally, I suggest verticaldivision as the standard representation.For an example of use look here: http://xkcd.com/1098/-- Jrg Knappen



Re: Sample of german -burg abbreviature

2004-10-02 Thread Jörg Knappen
Michael Everson schrieb:

 I assumed that the curly thing used over the letter u in German 
 handwriting was a breve (not a combining u superimposed over a u), 
 and so in these examples though the u is deleted, its breve is not.

I agree with Michael, that the thing is a breve -- however with an unusual
plaecement.

To me, there are three resolutions two the burg-abbreviature problem:

1) Add one new character, ZERO WIDTH INVISIBLE LETTER, to the UCS. Encode
   the burg-abbreviature as
   bzwilcomb. breve aboveg

2) Add one new character, COMBINING RIGHT SHIFTED BREVE ABOVE, to the UCS.
   Encode the burg-abbreviature as
   bcomb. right shifted breve aboveg

3) Add two new characters, LATIN SMALL ABBREVIATURE BURG, and
   LATIN CAPTITAL ABBREVIATUR BURG, to the UCS. Then, the 
   burg-abbreviature is one UNicode character.

[Note: The burg-abbreviature can occur in an all-caps context with the 
breve placed in the middle between capital B and G.]

I strongly prefer solution 1 because it is fully general with a minimum of
effort added. It can also handle TeX's tie accent.

TeX's tie accent is an inverted right shifted breve above -- that's how it 
is implemented in TeX and METAFONT by Donald Knuth. It has the width of a
normal accent, but the glyph hangs out of its bounding box such that it is 
placed between two letters. The thing is used in some transliteration of 
russian, where the letter ya is transcribed as \t{\ia}, i. e. an 
inverted breve placed between a dotless i (\i) and a. A sample can be 
found in Donald E. Knuth, the TeXbook.

Solution 2) is also a good one and it can be extended easily to the case 
of TeX's tie accent by adding a second character, COMBINING RIGHT SHIFTED 
INVERTED BREVE ABOVE, to the UCS.

Solution 3) is ad hoc and will probably open the door for dozens of other
candiates (like the tied ia).

--Jorg Knappen

P.S. The thing in the burg-abbreviature is clearly *not* a raised u: a 
raised small u has a right stem which I have never seen in the burg 
abbreviature. The breve is a mnemonic hint to the u, since it was once 
obligatory to mark all u's with a breve in german handwriting (Suetterlin)
-- and it is still wide spread practice. 




Sample of german -burg abbreviature

2004-09-26 Thread Jörg Knappen
I have scanned a sample of the german -burg abbreviature. It is from
Diercke Weltatlas, 165. Auflage, Georg Westermann Verlag, 
Braunschweig 1972, card page 14.

In the north you can find two times the -berg abbreviature in Herrenbg. 
[Herrenberg] and Brombg. [Bromberg]. SW from Tuebingen you find 
Rottenb[U]g. [Rottenburg] and south of it there's Weilerb[U]g. 
[Weilerburg].

Note the fancy semi-cyrillic shape of the breve between the letters 
b and g -- it is quite typical for this cartographic font. I don't know
what they do with a true breve (like in Romanian) since this atlas
transkribes all names into german.

The symbol fans may also note the circle with upright flag besides
Hohenentringen and Roseck (denoting a castle) and the circle with 
slanted flag (denoting the ruins of a castle) besides Weilerburg.
IMHO, the set of cartographic symbols is another one to be checked against 
UNicode.

--Jorg Knappen



Re: Sample of german -burg abbreviature

2004-09-26 Thread Jörg Knappen
On Sun, 26 Sep 2004, Adam Twardoch wrote:

 From: Jörg Knappen [EMAIL PROTECTED]
 
  I have scanned a sample of the german -burg abbreviature. It is from
  Diercke Weltatlas, 165. Auflage, Georg Westermann Verlag,
  Braunschweig 1972, card page 14.
 
 Very interesting! It would be even more interesting if you told us the URL
 so we can actually look at it! :)

Oh ... the locator is

http://www.uni-mainz.de/~knappen/diercke.jpg

--Jorg Knappen





RE: Saudi-Arabian Copyright sign

2004-09-21 Thread Jörg Knappen
Michael Everson schrieb:

 At 13:07 -0700 2004-09-20, Kenneth Whistler wrote:
 
 ARABIC HAH COPYRIGHT SIGN
* used in Saudi Arabia
 
 or even:
 
 CIRCLED ARABIC LETTER HAH
* a copyright sign used in Saudi Arabia

Both naming suggestions are fine with me. An aside: The arabic word for 
right is haqq --starting with the letter in the circle.

 The second would be better. And is the circled C used in Saudi Arabia 
 for the copyright used as well?

I will try and gather more information at the Frankfurt book fair 
(beginning of October) where the arabic world is guest of honour.

--Jorg Knappen




Re: Saudi-Arabian Copyright sign

2004-09-20 Thread Jörg Knappen
Doug Ewell schrieb:

 I'm not aware of any, but I see this U+20DD solution mentioned from time
 to time, as though it were a well-known alternative to encoding things
 like Warenzeichen or Gesch#tzte Sorte.

I see a precedent in Unicode to treat Copyright-like sign differently from 
simple encircled letters:

Unicode takes precautions not to encode the same character twice. 
Therefore, superscript digits 2 and 3 are absent from the superscript 
block U+2070 ff. 

However, the Block eclosed alphanumerics U+2460 ff includes encircled 
capital latin letters C, P, and R in addition to the copyright-like sing 
elsewhere.

--Jorg Knappen





Saudi-Arabian Copyright sign

2004-09-19 Thread Jörg Knappen
Scanning thru some arabian books the following sign attracted my 
attention:

It looks like ARABIC LETTER HAH (isolated form) in a circle. It 
obviously denotes copyright. It is used consistently in books
printed in Saudi-Arabia, but I have never seen it in a book from any other
country (including Yemen, UAE, Bahrain, Kuwait, Jordania, Egypt, Libya
and Morrocco).

Therefore I suggest the name SAUDI-ARABIAN COPYRIGHT SIGN for this one.
Since the block for letterlike symbols is already almost full, but there 
are gaps in the primary arabic block (U+0600-FF), it is IMO well placed 
there.

For a sample, see http://www.uni-mainz.de/~knappen/saudi.gif

--Jorg Knappen

P.S. Other interesting symbols on my home page:

WARENZEICHEN (encircled Wz) http://www.uni-mainz.de/~knappen/fremd_p7.jpg
 und  http://www.uni-mainz.de/~knappen/frem_p17.jpg
GESCHUETZTE SORTE (encircled S, like REGISTERED)
  http://www.uni-mainz.de/~knappen/gp_p159.jpg
  http://www.uni-mainz.de/~knappen/gp_p159a.jpg
  http://www.uni-mainz.de/~knappen/asi_p58.jpg
Some phonetic symbols with strikethrough:
  http://www.uni-mainz.de/~knappen/fremd_p8.jpg




RE: Saudi-Arabian Copyright sign

2004-09-19 Thread Jörg Knappen
On Sun, 19 Sep 2004, Jon Hanna wrote:

  For a sample, see http://www.uni-mainz.de/~knappen/saudi.gif
 
 Looks like {U+062D, U+20DD}

Yes, it does look like that. But it forms a separate entity, just like its
precedents COPYRIGHT SIGN or SOUND RECORDING COPYRIGHT SIGN or REGISTERED.

GESCHUETZE SORTE is a letterlike symbol of the same kind.

--Jorg Knappen




Re: Questions about diacritics

2004-09-14 Thread Jörg Knappen

In LaTeX2e with the Cork coding (for TeXnicians: \usepackage[T1]{fontenc})
there is a so-called compound word mark. It has the functions of
teh ZERO WIDTH NON JOINER in the UCS: It breaks ligatures, it can be used
to produce a final s in the middle of a word.

By design, it has zero width but x height. So it can be used to carry 
accents to be placed in the middle between two characters.

My classic for this situation is the german -burg abbreviature often seen
in cartography: It is -bg. with breve between b and g. The abbreviature 
-bg. without accent means -berg.

--Jorg Knappen



Re: Public Review Issues Update

2004-08-27 Thread Jörg Knappen

40Encoding of Latin Capital and Small Letter At

LATIN CAPITAL LETTER AT and LATIN SMALL LETTER AT are used as orthographic  
characters in the Koalib language of Sudan. Although similar in appearance  
to COMMERCIAL AT, LATIN SMALL LETTER AT should have different character  
properties. The main concern is the similarity in appearance of LATIN 
SMALL LETTER AT to COMMERCIAL AT. There are potential implications for 
Internet  protocols that use @.

I have read the short proposal and my answer is YES, of course, the UTC 
should accept these two characters. Keeping LATIN SMALL LETTER AT and
COMMERCIAL AT separate will keep internet protocolls sane. Unifying
the two will cause potential damage depending on the locale (guess of
@ being capitalized and mapped to something strange ...)

--Jorg Knappen




Re: Script l (U+2113)

2004-08-23 Thread Jörg Knappen
On Mon, 23 Aug 2004, Kevin Brown wrote:

 I've just noticed that the script l character (U+2113) is one of only 
 two apparently mandatory characters (the other being estimated U+212E) 
 included in addition to the MacOS Roman character set in a collection of 
 recently released Linotype fonts.
 
 Is there any other common usage for U+2113 apart from as the liter/litre 
 symbol that would explain its apparently mandatory inclusion in these 
 fonts?

It is used as a mathematical symbol. It started to make the letter l 
visibly distinct from the digit 1 but has got its own life since than.
 
 Also, does this symbol usually occur in only one style/weight, namely 
 italic regular?  Or does it also appear in upright regular, upright bold, 
 and italic bold depending on the typographic context?

I have never seen anything but italic regular in serious use, but TeX also
has a bold italic regular version of it available and because it is easily 
availble someone will have found a clever use for it.
 
--Jorg Knappen




Re: Mystery of Circled S solved

2004-08-23 Thread Jörg Knappen
On Mon, 23 Aug 2004, Anto'nio Martins-Tuva'lkin wrote:

 
  It is indentified as a letterlike symbol still missing from UNicode:
  GESCHUETZTE SORTE
  looks like: S in a circle
 
 U+0053 U+20DD looks very good when set in Code2000.

But it isn't GESCHUETZTE SORTE in its specific meaning. Neither is
U+24C8. The difference is the same as the difference between 
U+0052 U+20DD or U+24C7 from U+00AE REGISTERED SIGN. GESCHUETZTE SORTE
belongs to a class of special characters with a legal meaning (like 
COPYRIGHT SIGN and SOUND RECORDING COPYRIGHT SIGN, two name two others of 
this class).
 
  http://www.uni-mainz.de/~knappen/gp_p159.jpg
 
 Hm, the bug in the bottom line could be also included -- it would be
 of great use in computer programming litterature. (Nah, two great
 alternatives are already encoded: U+2F8D and U+BD81... ;-)

Indeed -- but this is another theme. There are about a dozen common 
gardening symbols used in german publications for decades now and they are 
worth a proposal.

--Jorg Knappen 




Mystery of Circled S solved

2004-08-22 Thread Jörg Knappen
Dear Unicoders, hallo Barbara

I finally solved the mystery of the circled S which has found its way to 
the AMS math fonts. It is indentified as a letterlike symbol still missing 
from UNicode:

GESCHUETZTE SORTE

looks like: S in a circle

meaning: A protected crop variety (there is a special protection
of crop varieties in germany and now also in EU. In german it is called
Sortenschutz and the registration agency is the Bundessortenamt)

usage: current, in mail order garden catalogues. It is often used together
with the registered sign.

In the following links you can see it:
http://www.uni-mainz.de/~knappen/asi_p58.jpg 
(from the catalogue of Ahrens + Sieberz, Spring 2004, page 58)
http://www.uni-mainz.de/~knappen/gp_p159.jpg
http://www.uni-mainz.de/~knappen/gp_p159a.jpg
(from the catalogue of Gaertner Poetschke, Wundervolle Gartenwelt, Autumn 
2003)

The latter example shows a variety which is both GESCHUETZE SORTE and 
REGISTERED.

For the design, I suggest to use a non-superscript version, following the
design of the registered sign. 

How it came to be included in the AMS fonts is still a mystery, since no 
mathematical use of it is known to me.

Yours,

Jorg Knappen





Some letters with strikethrough

2004-08-22 Thread Jörg Knappen
To my surprise I saw that 
LATIN SMALL LETTER TH WITH STRIKETHROUGH
is accepted to UNicode.

There are a four more letters of the same type used in a popular german
phonetic transscription, namely

LATIN SMALL LETTER CH WITH STRIKETHROUGH
LATIN SMALL LETTER DH WITH STRIKETHROUGH
LATIN SMALL LETTER NG WITH STRIKETHROUGH
LATIN SMALL LETTER SCH WITH STRIKETHROUGH

This list is complete.

For a reference see

http://www.uni-mainz.de/~knappen/fremd_p8.jpg

where this phonetic alphabet is explained. The scan is from

Der Kleine Duden Fremdwoerterbuch
3. Auflage 1991
Dudenverlag Mannhein Wien Zuerich

Pages 8 and 9

Yours,

Jorg Knappen



Warenzeichen

2004-08-02 Thread Jörg Knappen
The following letterlike symbol is still missing from UNicode:

Warenzeichen

looks like Encircled Wz -- the circle is actually an ellipse.
Usage: In german dictionaries and lexika it is used to denote words 
which are registered marks by some owners. This usage is still flowering.

I remember that the symbol was also used in advertisements once, but it
is replaced by the more fashionable REGISTERED SIGN or TRADEMARK SIGN
in this field.

The symbol occurs in the private use area of MS Reference Sans Serif
and MS Reference Serif in the PUA position U+F7BF.

--Jorg Knappen




Looking for transcription or transliteration standards latin-arabic

2004-06-30 Thread Jörg Knappen
Are there standards for transscribing or transliterating western languages 
written in latin to arabic? I am specifically interested in 
german-arabic, but english-arabic and french-arabic is of interest, 
too.

--Jorg Knappen




Filzlaus

2004-06-04 Thread Jörg Knappen

Browsing thro my printed version of UNicode 4.0 I discovered the follwing
annotations to U+00A4 (The currency sign):

Filzlaus, Ricardi-Sonne (german names)

May I request those names to be dropped? Both are jargon at best and not 
widespread. Filzlaus is meaning crab louse, and it is well known (to 
german speakers) where those animals live and how they spread.

Ricardi-Sonne is probably misspelled (should be Ricardo-Sonne, alluding to 
the famous economist David Ricardo). The name is not in wide use anyway.
Even Sputnik is better known...

--Jorg Knappen



Re: Game pieces proposal

2004-06-02 Thread Jörg Knappen
Antonio Martins-Tuvalkin schrieb:

 And the special sign printed on Joker cards (a five pointed star in
 a circle)

 Would suitable to use U+235F and/or U+272A?

Hmm... U+235F has the right graphical representation, but as a character
specific to the APL programming language it is probably unsuitable.

U+272A has a different look, it does not fit.

--Jorg Knappen





Re: Game pieces proposal

2004-06-02 Thread Jörg Knappen
Antonio Martins-Tuvalkin schrieb:

Hm, au contraire. Michael's quote above hints precisely that the goal
of encoding cards as separate individual characters is to overcome
that handicap.

Unfortunately, this does not reflect the litterature about gard games.
The litterature (and almost every german local daily has a weekly column 
about Skat) just uses the rather simple notation heatsuitletter K
to denote the King of Hearts, if they do not ressort to Herz-K at all.
The same is true for web sites about card gaming.

Of course one could encode instead a generic play card king
character, which Englisg fonts would render K etc, and still have
each card as a pair of characters.

If the card gamers' community agrees on generic symbols for the ranks of 
cards (like the chess players have already done, abandoning the letters
in chess litterature) they are worth encoding. But UNicode should not try 
to invent something which does not already exist in the world and has a 
solid standing there.

--Jorg Knappen

Who still thinks, that playing cards aren't characters of plain text.




Re: Game pieces proposal

2004-06-02 Thread Jörg Knappen

Rethinking about chess:

While I think, that the encoding of the chess pieces in UNicode is just 
right, I wonder about the other symbol used in chess notation, known as 
informant code system. Many of them may already be there, scattered in the 
mathematical and technical blocks. But some might be still absent, like
WHITE STANDS SLIGHTLY BETTER (glyph looks like plus over equals sign).

Wish, there were a stroke index to the mathematical symbols in UNicode.
I did such a thing for the symbols in TeX, LaTeX and AMSLATeX, it is 
published in my book Schnell ans Ziel mit LaTeX2e, 
Oldenbourg-Verlag, 2nd extended and revised  printing 2004.

A reference for the Informant code system:

ftp://ftp.ctan.de/tex-archive/fonts/chess/chess.zip (You need LaTeX to 
create the documentation, no ready ps or pdf file is included)

--Jorg Knappen





Re: Game pieces proposal

2004-06-02 Thread Jörg Knappen
Antonio Martins-Tuvalkin schrieb:

 
  chess notation, known as informant code system. ... some might be
  still absent, like WHITE STANDS SLIGHTLY BETTER (glyph looks like
  plus over equals sign).
 
 What about U+272A PLUS SIGN ABOVE EQUALS SIGN?
    should read 2A72

Hey, there it is! And BLACK STANDS SLIGHTLY BETTER is on 2A71.

--Jorg Knappen.

P.S. I found a readily printable version of the chess informator signs
on ftp://ftp.dante.de/tex-archive/info/symbols/comprensive
(there you can find ps files prepared for a4 paper and letter paper)
In my printed version, it is table 200. The whole document contains lots 
of symbols people found worth doing in LaTeX. Probably raw material for 
half a dozen proposals.



Re: Game pieces proposal

2004-05-31 Thread Jörg Knappen

I think, playing card are not characters (for use in plain text). Usually,
one says (in german) Herz-A, Herz-K, Herz-10 or one uses the suit 
characters already in UNicode heartsuit10 etc.

Only the french suits are currently in, one can consider encoding the 
suits of Mahjongg, german, swiss, italian and spanish card, too. And 
the special sign printed on Joker cards (a five pointed star in a 
circle) But not more.

For information on card games and cards, you may want to consult
http://www.pagat.com

or  Detlef Hoffamnn: Kultur- und Kunstgeschichte der Spielkarte, 
Jonas-Verlag, Marburg, 1995.

--Jorg Knappen



Re: Game pieces proposal

2004-05-31 Thread Jörg Knappen
There's another point about playing cards: The letters for the figures
are language-dependent. While english has AKQJ, german has AKDB and other
languages still have other letters (all for french style cards here, 
german suite are still different with DKOU in german). Once one start to 
encode whole playing cards, one has to do it for all local letters...

--Jorg Knappen