Re: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Hans Åberg via Unicode


> On 21 Feb 2020, at 13:21, Costello, Roger L. via Unicode 
>  wrote:
> 
> There are binary files and there are text files.

In C, when opening a file as binary with the function fopen, the newlines are 
untranslated [1]. If not using this option, the file is informally text, which 
means that internally in the program, one can assume that the newline [2] is 
the character U+000A LINE FEED (LF).

1. https://en.cppreference.com/w/cpp/io/c/fopen
2. https://en.wikipedia.org/wiki/Newline





Re: What should or should not be encoded in Unicode? (from Re: Egyptian Hieroglyph Man with a Laptop)

2020-02-14 Thread Hans Åberg via Unicode

> On 13 Feb 2020, at 16:41, wjgo_10...@btinternet.com via Unicode 
>  wrote:
> 
> Yet a Private Use Area encoding at a particular code point is not unique. 
> Thus, except with care amongst people who are aware of the particular 
> encoding, there is no interoperability, such as with regular Unicode encoded 
> characters.
> 
> However faced with a need for interoperability for my research project, I 
> have found a solution making use of the Glyph Substitution capability of an 
> OpenType font.
> 
> The solution is to invent my own encoding space. This sits on top of Unicode, 
> could be (perhaps?) called markup, but it works!

It may be perilous, because some software may enforce the strict official code 
point limits.



Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Hans Åberg via Unicode


> On 13 Feb 2020, at 00:26, Shawn Steele  wrote:
> 
>> From the point of view of Unicode, it is simpler: If the character is in use 
>> or have had use, it should be included somehow.
> 
> That bar, to me, seems too low.  Many things are only used briefly or in a 
> private context that doesn't really require encoding.

That is a private use area for more special use.





Re: Egyptian Hieroglyph Man with a Laptop

2020-02-12 Thread Hans Åberg via Unicode


> On 12 Feb 2020, at 23:30, Michel Suignard via Unicode  
> wrote:
> 
> These abstract collections have started to appear in the first part of the 
> nineteen century (Champollion starting in 1822). Interestingly these 
> collections have started to be useful on their own even if in some case the 
> main use of  parts is self-referencing, either because the glyph is a known 
> mistake, or a ghost (character for which attestation is now firmly disputed). 
> For example, it would be very difficult to create a new set not including the 
> full Gardiner set, even if some of the characters are not necessarily 
> justified. To a large degree, Hieroglyphica (and its related collection 
> JSesh) has obtained that status as well. The IFAO (Institut Français 
> d’Archéologie Orientatle) set is another one, although there is no modern 
> font representing all of it (although many of the IFAO glyphs should not be 
> encoded separately).
> 
> There is obviously no doubt that the character in question is a 
> modern invention and not based on historical evidence. But interestingly 
> enough it has started to be used as a pictogram with some content value, 
> describing in fact an Egyptologist. It may not belong to that block, but it 
> actually describes an use case and has been used a symbol in some technical 
> publication.

>From the point of view of Unicode, it is simpler: If the character is in use 
>or have had use, it should be included somehow.





Re: Geological symbols

2020-01-14 Thread Hans Åberg via Unicode
For rendering, you might have a look at ConTeXt, because I recall it has an 
option whereby Unicode super- and sub-scripts can be displayed over each other 
without extra processing.


> On 14 Jan 2020, at 06:44, via Unicode  wrote:
> 
> Thanks for your reply. I think actually LaTeX is not a good option for our 
> purpose, because we want to create and disseminate datasets which are easy to 
> use and do not require any software or special font installation. Thus, we’ll 
> live with the little bit uglier version.
> Anyway, thanks!
> Thomas
>  




Re: Pure Regular Expression Engines and Literal Clusters

2019-10-14 Thread Hans Åberg via Unicode



> On 14 Oct 2019, at 02:10, Richard Wordingham via Unicode 
>  wrote:
> 
> On Mon, 14 Oct 2019 00:22:36 +0200
> Hans Åberg via Unicode  wrote:
> 
>>> On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode
>>>  wrote:
> 
>>> Besides invalidating complexity metrics, the issue was what \p{Lu}
>>> should match.  For example, with PCRE syntax, GNU grep Version 2.25
>>> \p{Lu} matches U+0100 but not .  When I'm respecting
>>> canonical equivalence, I want both to match [:Lu:], and that's what
>>> I do. [:Lu:] can then match a sequence of up to 4 NFD characters.  
> 
>> Hopefully some experts here can tune in, explaining exactly what
>> regular expressions they have in mind.
> 
> The best indication lies at
> https://www.unicode.org/reports/tr18/tr18-13.html#Canonical_Equivalents

The certificate has expired, one day ago, risking to steal personal and 
financial information says the browser, refusing to load it. So one has to load 
the totally insecure HTTP page for risk of creating a mayhem on the computer. 
:-)

> (2008), which is the last version before support for canonical
> equivalence was dropped as a requirement.

As said there, one might add all the equivalents if one can find them. 
Alternatively, one could normalize the regex and the string, keeping track of 
the translation boundaries on the string so that it can be translated back to a 
match on the original string if called for.





Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Hans Åberg via Unicode


> On 13 Oct 2019, at 23:54, Richard Wordingham via Unicode 
>  wrote:
> 
> The point about these examples is that the estimate of one state per
> character becomes a severe underestimate.  For example, after
> processing 20 a's, the NFA for /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/ can
> be in any of about 50 states.  The number of possible states is not
> linear in the length of the expression.  While a 'loop iteration' can
> keep the size of the compiled regex down, it doesn't prevent the
> proliferation of states - just add zeroes to my example. 

Formally only the expansion of such ranges are NFA, and I haven’t seen anyone 
considering the complexity with them included. So to me, it seems just a hack.

>> I made some C++ templates that translate Unicode code point character
>> classes into UTF-8/32 regular expressions. So anything that can be
>> reduced to actual regular expressions would work. 
> 
> Besides invalidating complexity metrics, the issue was what \p{Lu}
> should match.  For example, with PCRE syntax, GNU grep Version 2.25
> \p{Lu} matches U+0100 but not .  When I'm respecting
> canonical equivalence, I want both to match [:Lu:], and that's what I
> do. [:Lu:] can then match a sequence of up to 4 NFD characters.

Hopefully some experts here can tune in, explaining exactly what regular 
expressions they have in mind.





Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Hans Åberg via Unicode


> On 13 Oct 2019, at 21:17, Richard Wordingham via Unicode 
>  wrote:
> 
> On Sun, 13 Oct 2019 15:29:04 +0200
> Hans Åberg via Unicode  wrote:
> 
>>> On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode
>>> I'm now beginning to wonder what you are claiming.  
> 
>> I start with a NFA with no empty transitions and apply the subset DFA
>> construction dynamically for a given string along with some reverse
>> NFA-data that is enough to transverse backwards when a final state
>> arrives. The result is a NFA where all transverses is a match of the
>> string at that position.
> 
> And then the speed comparison depends on how quickly one can extract
> the match information required from that data structure.

Yes. For example, one should match the saved DFA in constant time, if matched 
as dynamic sets which is linear in set size, then one can get quadratic time 
complexity in string size.

Even though one can iterate through each match NFA in linear time, it could 
have say two choices at each character position each leading to the next, which 
would give an exponential size relative the string length.

Normally one is not interested in all matches, this is the disambiguation rules 
that do that.

> Incidentally, at least some of the sizes and timings I gave seem to be
> wrong even for strings.  They won't work with numeric quantifiers, as
> in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/.

For those, one normally implements a loop iteration. I did not that. I 
mentioned this method to Tim Shen on the libstdc++ list, so perhaps he might 
have implemented something.

> One gets lesser issues in quantifying complexity if one wants "Å" to
> match \p{Lu} when working in NFD - potentially a different state for
> each prefix of the capital letters.  (It's also the case except for
> UTF-32 if characters are treated as sequences of code units.)  Perhaps
> 'upper case letter that Unicode happens to have encoded as a single
> character' isn't a concept that regular expressions need to support
> concisely. What's needed is to have a set somewhere between
> [\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be extended to
> include "ff" - there are English surnames like "ffrench”.

I made some C++ templates that translate Unicode code point character classes 
into UTF-8/32 regular expressions. So anything that can be reduced to actual 
regular expressions would work. 





Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Hans Åberg via Unicode


> On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode 
>  wrote:
> 
>>> On Sat, 12 Oct 2019 21:36:45 +0200
>>> Hans Åberg via Unicode  wrote:
>>> 
>>>>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
>>>>>  wrote:
>>>>> 
>>>>> But remember that 'having longer first' is meaningless for a
>>>>> non-deterministic finite automaton that does a single pass through
>>>>> the string to be searched.
>>>> 
>>>> It is possible to identify all submatches deterministically in
>>>> linear time without backtracking — I a made an algorithm for
>>>> that.  
> 
> I'm now beginning to wonder what you are claiming.

I start with a NFA with no empty transitions and apply the subset DFA 
construction dynamically for a given string along with some reverse NFA-data 
that is enough to transverse backwards when a final state arrives. The result 
is a NFA where all transverses is a match of the string at that position. 




Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Hans Åberg via Unicode


> On 13 Oct 2019, at 00:37, Richard Wordingham via Unicode 
>  wrote:
> 
> On Sat, 12 Oct 2019 21:36:45 +0200
> Hans Åberg via Unicode  wrote:
> 
>>> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode
>>>  wrote:
>>> 
>>> But remember that 'having longer first' is meaningless for a
>>> non-deterministic finite automaton that does a single pass through
>>> the string to be searched.  
>> 
>> It is possible to identify all submatches deterministically in linear
>> time without backtracking — I a made an algorithm for that.
> 
> That's impressive, as the number of possible submatches for a*(a*)a* is
> quadratic in the string length.

That is probably after the possibilities in the matching graph have been 
expanded, which can even be exponential. As an analogy, think of a polynomial 
product, I compute the product, not the expansion.




Re: Pure Regular Expression Engines and Literal Clusters

2019-10-12 Thread Hans Åberg via Unicode


> On 12 Oct 2019, at 14:17, Richard Wordingham via Unicode 
>  wrote:
> 
> But remember that 'having longer first' is meaningless for a
> non-deterministic finite automaton that does a single pass through the
> string to be searched.

It is possible to identify all submatches deterministically in linear time 
without backtracking — I a made an algorithm for that.

A selection among different submatches then requires additional rules.





Re: Symbols of colors used in Portugal for transport

2019-04-30 Thread Hans Åberg via Unicode


> On 30 Apr 2019, at 04:32, Mark E. Shoulson via Unicode  
> wrote:
> 
> On 4/29/19 3:34 PM, Doug Ewell via Unicode wrote:
>> Hans Åberg wrote:
>>  
>>> The guy who made the artwork for Heroes is completely color-blind,
>>> seeing only in a grayscale, so they agreed he coded the colors in
>>> black and white, and then that was replaced with colors.
>>  Did he use this particular scheme? That is something I would expect to
>> see on the scheme's web site, and would probably be good evidence for a
>> proposal.
> 
> And what about existing schemes, such as have already been in use even by the 
> esteemed company present on this very list, and in several fonts, for the 
> same purpose?  See https://en.wikipedia.org/wiki/Hatching_(heraldry)

It is notable that historically, one started with written abbreviations but 
later shifted to patterns, so possibly the latter is more effective.

>> I do see several awards related to the concept, but few examples where
>> this scheme is actually in use, especially in plain text.
>>  I'm not opposed to this type of symbol, but I like to think the classic
>> rule about "established, not ephemeral" would still apply.
> 
> Indeed.
> 
> If there were encoded mere color patches (like, say, colored circles, 
> possibly in the U+1F534 range or something; just musing here), would those 
> already count as encoding these sorts of things, as black-and-white font 
> designers would be likely to interpret them in some readable fashion, perhaps 
> with hatching. Is it better to have the color be canonical and the hatched 
> design a matter of design, or have a set of hatched circles with fixed 
> hatching?

Also note the screentone and halftone articles [1-2]. In addition, there are 
reverse Ishihara tests that those with color deficiency can read correctly, but 
not those with normal color vision, relying an enhanced capability to detect 
smaller nuances in intensity.

1. https://en.wikipedia.org/wiki/Screentone
2. https://en.wikipedia.org/wiki/Halftone





Re: Symbols of colors used in Portugal for transport

2019-04-29 Thread Hans Åberg via Unicode


> On 29 Apr 2019, at 21:34, Doug Ewell  wrote:
> 
> Hans Åberg wrote:
> 
>> The guy who made the artwork for Heroes is completely color-blind,
>> seeing only in a grayscale, so they agreed he coded the colors in
>> black and white, and then that was replaced with colors.
> 
> Did he use this particular scheme? That is something I would expect to
> see on the scheme's web site, and would probably be good evidence for a
> proposal.

They did not describe what system they used, but my impression was different 
patterns, so it would still look artistic, only in black and white. However it 
was a long time ago, so my memory may fail me. It is described in some of the 
DVD extra material.





Re: Symbols of colors used in Portugal for transport

2019-04-29 Thread Hans Åberg via Unicode


> On 29 Apr 2019, at 20:02, Doug Ewell via Unicode  wrote:
> 
> Philippe Verdy wrote:
> 
>> A very useful think to add to Unicode (for colorblind people) !
>> 
>> http://bestinportugal.com/color-add-project-brings-color-identification-to-the-color-blind
>> 
>> Is it proposed to add as new symbols ?
> 
> Well, it isn't proposed until someone proposes it.
> 
> At first I thought Emojination would be best to write this proposal, to
> improve its chances of approval. But these aren't really emoji; they're
> actual text-like symbols, of the type that has always been considered
> appropriate for Unicode. (They're not "for transport" per se; they are a
> secondary indication of colors, meant for the color-blind.)
> 
> One important question that a proposal would need to answer is whether
> these symbols are actually used in the real world. They seem like a good
> and innovative new idea, and there is always a desire to help people
> with physical challenges; but neither of those is what Unicode is about.
> For non-emoji characters, there is usually still a requirement to show a
> certain level of actual usage.

The guy who made the artwork for Heroes is completely color-blind, seeing only 
in a grayscale, so they agreed he coded the colors in black and white, and then 
that was replaced with colors.

https://en.wikipedia.org/wiki/Heroes_(U.S._TV_series)





Re: MODIFIER LETTER SMALL GREEK PHI in Calibri is wrong.

2019-04-17 Thread Hans Åberg via Unicode
You are possibly both right, because it is OK in the web font but wrong in the 
desktop font.


> On 17 Apr 2019, at 23:53, Oren Watson via Unicode  wrote:
> 
> You can easily reproduce this by going here: 
> https://www.fonts.com/font/microsoft-corporation/calibri/regular
> and putting in the following string: ψϕφᵠ
> 
> On Wed, Apr 17, 2019 at 5:23 PM James Tauber  wrote:
> It looks correct in Google Docs so it appears to have been fixed in whatever 
> version of the font is used there.
> 
> James
> 
> On Wed, Apr 17, 2019 at 5:10 PM Oren Watson via Unicode  
> wrote:
> Would anyone know where to report this?
> In the widely used Calibri typeface included with MS Office, the glyph shown 
> for U+1D60 MODIFIER LETTER SMALL GREEK PHI, actually depicts a letter psi, 
> not a phi.




Re: A last missing link for interoperable representation

2019-01-15 Thread Hans Åberg via Unicode


> On 15 Jan 2019, at 02:18, Richard Wordingham via Unicode 
>  wrote:
> 
> On Mon, 14 Jan 2019 16:02:05 -0800
> Asmus Freytag via Unicode  wrote:
> 
>> On 1/14/2019 3:37 PM, Richard Wordingham via Unicode wrote:
>> On Tue, 15 Jan 2019 00:02:49 +0100
>> Hans Åberg via Unicode  wrote:
>> 
>> On 14 Jan 2019, at 23:43, James Kass via Unicode
>>  wrote:
>> 
>> Hans Åberg wrote,
>> 
>> How about using U+0301 COMBINING ACUTE ACCENT: 푝푎푠푠푒́  
>> 
>> Thought about using a combining accent.  Figured it would just
>> display with a dotted circle but neglected to try it out first.  It
>> actually renders perfectly here.  /That's/ good to know.  (smile)  
>> 
>> It is a bit off here. One can try math, too: the derivative of 훾(푡)
>> is 훾̇(푡).
>> 
>> No it isn't.  You should be using a spacing character for
>> differentiation. 
>> 
>> Sorry, but there may be different conventions. The dot / double-dot
>> above is definitely common usage in physics.

Also in differential geometry, as for curves.

>> A./
> 
> Apologies.  It was positioned in the parenthesis, and it looked like a
> misplaced U+0301.

In MacOS, one can drop the combined character into the character table, and see 
that it is U+0307 COMBINING DOT ABOVE.

It comes out right when typeset in ConTeXt.





Re: A last missing link for interoperable representation

2019-01-14 Thread Hans Åberg via Unicode



> On 14 Jan 2019, at 23:43, James Kass via Unicode  wrote:
> 
> Hans Åberg wrote,
> 
> > How about using U+0301 COMBINING ACUTE ACCENT: 푝푎푠푠푒́
> 
> Thought about using a combining accent.  Figured it would just display with a 
> dotted circle but neglected to try it out first.  It actually renders 
> perfectly here.  /That's/ good to know.  (smile)

It is a bit off here. One can try math, too: the derivative of 훾(푡) is 훾̇(푡).





Re: A last missing link for interoperable representation

2019-01-14 Thread Hans Åberg via Unicode


> On 13 Jan 2019, at 22:43, Khaled Hosny via Unicode  
> wrote:
> 
> LaTeX with the
> “unicode-math” package will translate ASCII + font switches to the
> respective Unicode math alphanumeric characters. Word will do the same.
> Even browsers rendering MathML will do the same (though most likely the
> MathML source will have the math alphanumeric characters already).

For full translation, one probably has to use ConTexT and LuaTeX. Then, along 
with PDF, one can also generate HTML with MathML.





Re: A last missing link for interoperable representation

2019-01-14 Thread Hans Åberg via Unicode


> On 14 Jan 2019, at 06:08, James Kass via Unicode  wrote:
> 
> 퐴푟푡 푛표푢푣푒푎푢 seems a bit 푝푎푠푠é nowadays, as well.
> 
> (Had to use mark-up for that “span” of a single letter in order to indicate 
> the proper letter form.  But the plain-text display looks crazy with that 
> HTML jive in it.)

How about using U+0301 COMBINING ACUTE ACCENT: 푝푎푠푠푒́




Re: Update to the second question summary (was: A sign/abbreviation for "magister")

2018-12-02 Thread Hans Åberg via Unicode


> On 2 Dec 2018, at 20:29, Janusz S. Bień via Unicode  
> wrote:
> 
> On Sun, Dec 02 2018 at 10:33 +0100, Hans Åberg via Unicode wrote:
>> 
>> It was common in the 1800s to singly and doubly underline superscript
>> abbreviations in handwriting according to [1-2], and [2] also mentions
>> the abbreviation discussed in this thread.
> 
> Thank you very much for this reference to the very abbreviation! I
> looked up Wikipedia but I haven't read it carefully enough :-(

Quite of a coincidence, as I was looking at the article topic, and it happened 
to have this remark embedded!

>> 1. https://en.wikipedia.org/wiki/Ordinal_indicator
>> 2. https://en.wikipedia.org/wiki/Ordinal_indicator#cite_note-1





Re: A sign/abbreviation for "magister"

2018-12-02 Thread Hans Åberg via Unicode


> On 30 Oct 2018, at 22:50, Ken Whistler via Unicode  
> wrote:
> 
> On 10/30/2018 2:32 PM, James Kass via Unicode wrote:
>> but we can't seem to agree on how to encode its abbreviation. 
> 
> For what it's worth, "mgr" seems to be the usual abbreviation in Polish for 
> it.

It was common in the 1800s to singly and doubly underline superscript 
abbreviations in handwriting according to [1-2], and [2] also mentions the 
abbreviation discussed in this thread.

1. https://en.wikipedia.org/wiki/Ordinal_indicator
2. https://en.wikipedia.org/wiki/Ordinal_indicator#cite_note-1





Re: Aleph-umlaut

2018-11-11 Thread Hans Åberg via Unicode



> On 12 Nov 2018, at 00:00, Asmus Freytag (c)  wrote:
> 
> On 11/11/2018 1:37 PM, Hans Åberg wrote:
>>> On 11 Nov 2018, at 22:16, Asmus Freytag via Unicode 
>>>  wrote:
>>> 
>>> On 11/11/2018 12:32 PM, Hans Åberg via Unicode wrote:
>>> 
>> One should not rely too much these autotranslation tools, but it may be 
>> quicker using some OCR program and then correct by hand, than entering it 
>> all by hand. The setup did not admit transliterating Hebrew script directly 
>> into German. It seems that the translator program recognizes it as Yiddish, 
>> though it might be as a result of an assumption it makes.
> 
> Well, the OCR does a much better job than the "translation".

Not so surprising, but it did not have a literal OCR. An OCR can improve 
transliteration by guessing the language to fill in partial recognition, so 
there is a fallacy already there.

>> The German translation it gives:
>> Unsere Sünde kommt von der Seite der Verletzten, nachdem sie darauf gewartet 
>> hat, erwartet zu werden, und nachdem sie die Vorstellungen dieser 
>> rabbinischen Andachten kennengelernt haben, haben sie begonnen, mit der 
>> Motivation zu schließen:
> 
> This is simply utter nonsense and does not even begin to correlate with the 
> transliteration. 
> 
>> And in English:
>> Our sin is coming out of the side of the injured side, after waiting to be 
>> expected, and having the concepts of these rabbinical devotiones, they have 
>> begun to agree with the motivation:
> 
> In fact, the English translation makes somewhat more sense. For example, 
> "Gegenpartei" in many legal contexts (which this sample isn't, by the way) 
> can in fact be translated as "injured party", which in turn might correlate 
> with an "injured side" as rendered. However "Seite der Verletzten" makes no 
> sense in this context, unless there's a Hebrew word that accidentally matches 
> and got picked up.
> (I'm suspicious that some of the auto translation does in fact work like many 
> real translations which often are not direct, but involve an intermediate 
> language - simply because it's not possible to find sufficient translators 
> between random pairs of languages.).

Google translation uses AI by comparing texts in both languages, the Rosetta 
stone method. Therefore, there is a poor result for languages where there are 
less available texts to compare with. Sometimes it can be better than 
dictionaries if it concerns more modern terms. But in other cases, it may just 
be gibberish.

>> From the original Hebrew script, in case someone wants to try out more 
>> possibilities:
>> וויר זינד אונס דעססען בעוואוסט דאסס פֿאָן זייטע דער גע־ געפארטהיי וועדער 
>> רייע , נאך איינזיכט צו ערווארטען איזט אונד דאסט זיא דיא קאַנסעקווענצען 
>> דיעזער ראבבינישען גוטאכטען פֿאָן זיך אבשיטטעלען ווערדען מיט דער מאָטיווירונג 
>> , דאסס :
>> 
> I don't know what that will tell you. You have a rendering that produces 
> coherent text which closely matches a phonetic transliteration. What else do 
> you hope to learn?

It is up to whoever likes to try (FYI).





Re: Aleph-umlaut

2018-11-11 Thread Hans Åberg via Unicode


> On 11 Nov 2018, at 22:16, Asmus Freytag via Unicode  
> wrote:
> 
> On 11/11/2018 12:32 PM, Hans Åberg via Unicode wrote:
>> 
>>> On 11 Nov 2018, at 07:03, Beth Myre via Unicode 
>>>  wrote:
>>> 
>>> Hi Mark,
>>> 
>>> This is a really cool find, and it's interesting that you might have a 
>>> relative mentioned in it.  After looking at it more, I'm more convinced 
>>> that it's German written in Hebrew letters, not Yiddish.  I think that 
>>> explains the umlauts.  Since the text is about Jewish subjects, it also 
>>> includes Hebrew words like you mentioned, just like we would include beit 
>>> din or p'sak in an English text.
>>> 
>>> Here's a paragraph from page 22:
>>> 
>> Actually page 21.
>> 
>> 
>>> 
>>> 
>>> I (re-)transliterated it, and it reads:
>>> 
>> Taking a picture in the Google Translate app, and then pasting the Hebrew 
>> character string it identifies into translate.google.com for Yiddish gives 
>> the text:
>> 
>> 
>>> Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), 
>>> noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser 
>>> rabbinischen Gutachten von sich abschüttelen werden mit der Motivierung, 
>>> dass:
>>> 
>> vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , 
>> nakh eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer 
>> rabbinishen gutakhten fon zikh abshittelen verden mit der motivirung ,  dass 
>> :
> 
> This agrees rather well with Beth's retranslation.
> Mapping "z" to "s",  "f" to "v" and "v" to "w" would match the way these 
> pronunciations are spelled in German (with a few outliers like "izt" for 
> "ist", where the "s" isn't voiced in German). There's also a clear convention 
> of using "kh" for "ch" (as in English "loch" but also for other pronunciation 
> of the German "ch"). The one apparent mismatch is "ge- gefarthey" for 
> "Gegenpartei". Presumably what is transliterated as "f" can stand for 
> phonetic "p". "Parthey" might be how Germans could have written "Partei" in 
> earlier centuries (when "th" was commonly used for "t" and "ey" alternated 
> with "ei", as in my last name).  So, perhaps it's closer than it looks, 
> superficially.
> From context, "Reue" is by far the best match for "Reye" and seems to match a 
> tendency elsewhere in the sample where the transliteration, if pronounced as 
> German, would result in a shifted quality for the vowels (making them sound 
> more Yiddish, for lack of a better description).
> 
> "abschüttelen" - here the second "e" would not be part of Standard German 
> orthography. It's either an artifact of the transcription system or possibly 
> reflects that the writer is familiar with a different spelling convention (to 
> my eyes the spelling "abshittelen" looks somehow more Yiddish, but I'm really 
> not familiar enough with that language).
> 
> But still, the text is unquestionably intended to be in German.

One should not rely too much these autotranslation tools, but it may be quicker 
using some OCR program and then correct by hand, than entering it all by hand. 
The setup did not admit transliterating Hebrew script directly into German. It 
seems that the translator program recognizes it as Yiddish, though it might be 
as a result of an assumption it makes.

The German translation it gives:
Unsere Sünde kommt von der Seite der Verletzten, nachdem sie darauf gewartet 
hat, erwartet zu werden, und nachdem sie die Vorstellungen dieser rabbinischen 
Andachten kennengelernt haben, haben sie begonnen, mit der Motivation zu 
schließen:

And in English:
Our sin is coming out of the side of the injured side, after waiting to be 
expected, and having the concepts of these rabbinical devotiones, they have 
begun to agree with the motivation:

>From the original Hebrew script, in case someone wants to try out more 
>possibilities:
וויר זינד אונס דעססען בעוואוסט דאסס פֿאָן זייטע דער גע־ געפארטהיי וועדער רייע , 
נאך איינזיכט צו ערווארטען איזט אונד דאסט זיא דיא קאַנסעקווענצען דיעזער 
ראבבינישען גוטאכטען פֿאָן זיך אבשיטטעלען ווערדען מיט דער מאָטיווירונג , דאסס :






Re: Aleph-umlaut

2018-11-11 Thread Hans Åberg via Unicode



> On 11 Nov 2018, at 07:03, Beth Myre via Unicode  wrote:
> 
> Hi Mark,
> 
> This is a really cool find, and it's interesting that you might have a 
> relative mentioned in it.  After looking at it more, I'm more convinced that 
> it's German written in Hebrew letters, not Yiddish.  I think that explains 
> the umlauts.  Since the text is about Jewish subjects, it also includes 
> Hebrew words like you mentioned, just like we would include beit din or p'sak 
> in an English text.
> 
> Here's a paragraph from page 22:

Actually page 21.

> 
> 
> 
> I (re-)transliterated it, and it reads:

Taking a picture in the Google Translate app, and then pasting the Hebrew 
character string it identifies into translate.google.com for Yiddish gives the 
text:

> Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), 
> noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser 
> rabbinischen Gutachten von sich abschüttelen werden mit der Motivierung, dass:

vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , nakh 
eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer rabbinishen 
gutakhten fon zikh abshittelen verden mit der motivirung ,  dass :





Re: A sign/abbreviation for "magister"

2018-10-30 Thread Hans Åberg via Unicode


> On 30 Oct 2018, at 22:50, Ken Whistler via Unicode  
> wrote:
> 
> On 10/30/2018 2:32 PM, James Kass via Unicode wrote:
>> but we can't seem to agree on how to encode its abbreviation. 
> 
> For what it's worth, "mgr" seems to be the usual abbreviation in Polish for 
> it.

That seems to be the contemporary usage, but the postcard is from 1917, cf. the 
OP. Also, the transcription in the followup post suggests that the Polish 
script at the time, or at least of the author, differed from the commonly 
taught D'Nealian cursive [1], cf. the "z". A variation of the latter has ended 
up as the Unicode MATHEMATICAL SCRIPT letters, which is closer to the Swedish 
cursive [2] for some letters.

1. https://en.wikipedia.org/wiki/D'Nealian
2. https://sv.wikipedia.org/wiki/Skrivstil





Re: Unicode String Models

2018-09-12 Thread Hans Åberg via Unicode


> On 12 Sep 2018, at 04:34, Eli Zaretskii via Unicode  
> wrote:
> 
>> Date: Wed, 12 Sep 2018 00:13:52 +0200
>> Cc: unicode@unicode.org
>> From: Hans Åberg via Unicode 
>> 
>> It might be useful to represent non-UTF-8 bytes as Unicode code points. One 
>> way might be to use a codepoint to indicate high bit set followed by the 
>> byte value with its high bit set to 0, that is, truncated into the ASCII 
>> range. For example, U+0080 looks like it is not in use, though I could not 
>> verify this.
> 
> You must use a codepoint that is not defined by Unicode, and never
> will.  That is what Emacs does: it extends the Unicode codepoint space
> beyond 0x10.

The idea is to extend Unicode itself, so that those bytes can be represented by 
legal codepoints. Then U+0080 has had some use in other encodings, but it looks 
like not in Unicode itself. But one could use some other value or values, and 
mark it for this special purpose.

There are a number of other byte sequences that are in use, too, like overlong 
UTF-8. Also original UTF-8 can be extended to handle all 32-bit words, also 
those with the high bit set, then.




Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode


> On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode 
>  wrote:
> 
> On Tue, 11 Sep 2018 21:10:03 +0200
> Hans Åberg via Unicode  wrote:
> 
>> Indeed, before UTF-8, in the 1990s, I recall some Russians using
>> LaTeX files with sections in different Cyrillic and Latin encodings,
>> changing the editor encoding while typing.
> 
> Rather like some of the old Unicode list archives, which are just
> concatenations of a month's emails, with all sorts of 8-bit encodings
> and stretches of base64.

It might be useful to represent non-UTF-8 bytes as Unicode code points. One way 
might be to use a codepoint to indicate high bit set followed by the byte value 
with its high bit set to 0, that is, truncated into the ASCII range. For 
example, U+0080 looks like it is not in use, though I could not verify this.




Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode


> On 11 Sep 2018, at 20:40, Eli Zaretskii  wrote:
> 
>> From: Hans Åberg 
>> Date: Tue, 11 Sep 2018 20:14:30 +0200
>> Cc: hsivo...@hsivonen.fi,
>> unicode@unicode.org
>> 
>> If one encounters a file with mixed encodings, it is good to be able to view 
>> its contents and then convert it, as I see one can do in Emacs.
> 
> Yes.  And mixed encodings is not the only use case: it may well happen
> that the initial attempt to decode the file uses incorrect assumption
> about the encoding, for some reason.
> 
> In addition, it is important that changing some portion of the file,
> then saving the modified text will never change any part that the user
> didn't touch, as will happen if invalid sequences are rejected at
> input time and replaced with something else.

Indeed, before UTF-8, in the 1990s, I recall some Russians using LaTeX files 
with sections in different Cyrillic and Latin encodings, changing the editor 
encoding while typing.





Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode


> On 11 Sep 2018, at 19:21, Eli Zaretskii  wrote:
> 
>> From: Hans Åberg 
>> Date: Tue, 11 Sep 2018 19:13:28 +0200
>> Cc: Henri Sivonen ,
>> unicode@unicode.org
>> 
>>> In Emacs, each raw byte belonging
>>> to a byte sequence which is invalid under UTF-8 is represented as a
>>> special multibyte sequence.  IOW, Emacs's internal representation
>>> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
>>> This allows mixing stray bytes and valid text in the same buffer,
>>> without risking lossy conversions (such as those one gets under model
>>> 2 above).
>> 
>> Can you give a reference detailing this format?
> 
> There's no formal description as English text, if that's what you
> meant.  The comments, macros and functions in the files
> src/character.[ch] in the Emacs source tree tell most of that story,
> albeit indirectly, and some additional info can be found in the
> section "Text Representation" of the Emacs Lisp Reference manual.

OK. If one encounters a file with mixed encodings, it is good to be able to 
view its contents and then convert it, as I see one can do in Emacs.





Re: Unicode String Models

2018-09-11 Thread Hans Åberg via Unicode


> On 11 Sep 2018, at 13:13, Eli Zaretskii via Unicode  
> wrote:
> 
> In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence.  IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> This allows mixing stray bytes and valid text in the same buffer,
> without risking lossy conversions (such as those one gets under model
> 2 above).

Can you give a reference detailing this format?





Re: Unicode String Models

2018-09-10 Thread Hans Åberg via Unicode


> On 9 Sep 2018, at 21:20, Eli Zaretskii via Unicode  
> wrote:
> 
> In Emacs, the gap is always where the text is inserted or deleted, be
> it in the middle of text or at its end.
> 
>> All editors I have seen treat the text as ordered collections of small 
>> buffers (these small buffers may still have
>> small gaps), which are occasionnally merged or splitted when needed (merging 
>> does not cause any
>> reallocation but may free one of the buffers), some of them being paged out 
>> to tempoary files when memory is
>> stressed. There are some heuristics in the editor's code to when 
>> mainatenance of the collection is really
>> needed and useful for the performance.
> 
> My point was to say that Emacs is not one of these editors you
> describe.

FYI, gap and rope buffers are described at [1-2]; also see the Emacs manual [3].

1. https://en.wikipedia.org/wiki/Gap_buffer
2. https://en.wikipedia.org/wiki/Rope_(data_structure)
3. https://www.gnu.org/software/emacs/manual/html_node/elisp/Buffer-Gap.html





Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-08 Thread Hans Åberg via Unicode


> On 8 Jun 2018, at 11:07, Henri Sivonen via Unicode  
> wrote:
> 
> My question is:
> 
> When designing a syntax where tokens with the user-chosen characters
> can't occur next to each other without some syntax-reserved characters
> between them, what advantages are there from limiting the user-chosen
> characters according to UAX #31 as opposed to treating any character
> that is not a syntax-reserved character as a character that can occur
> in user-named tokens?

It seems best to stick to the canonical forms and add the sequences one deems 
useful and safe, as treating inequivalent characters as equal is likely to be 
confusing. But this requires more work; it seems that the use of the 
compatibility forms is aimed at something simple to implement.





Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Hans Åberg via Unicode
Now that the distinction is possible, it is recommended to do that.

My original question was directed to the OP, whether it is deliberate.

And they are confusables only to those not accustomed to it.


> On 7 Jun 2018, at 12:05, Philippe Verdy  wrote:
> 
> In my opinion the usual constant is most often shown as  "휋" (curly serifs, 
> slightly slanted) in mathematical articles and books (and in TeX), but rarely 
> as "π" (sans-serif).
> 
> There's a tradition of using handwriting for this symbol on backboards (not 
> always with serifs, but still often slanted). A notation with the "π" symbol 
> uses a legacy troundtrip mapping for old OEM charsets on low-resolution text 
> terminals where it was distinguished from the more common Greek letter which 
> was enhanced for better readability once old low-resolution terminals were 
> replaced. "π" looks too much like an Hangul letter or a legacy box-drawing 
> character and in fact difficult to recognize as the pi constant, but it may 
> still be found in some plain-text paragraphs of inline mathematical formulas 
> on screens (for programmers), at low resolution or with small font sizes, 
> where most text is in sans-serif Latin and not slanted/italicized and not 
> using an handwritten style.
> 
> If you think about writing a functional programming language using inline 
> formulas, then  the "π" symbol may be ok for the constant, and custom 
> identifiers for a function would use standard Greek letters (or other 
> standard scripts for human languages), or would use "pi" in Latin. You would 
> then write "pi(π)" in that inline formula. For a classic 2D mathematical 
> layout, you would use  "pi(휋)" with distinctive but homonegeous styles for 
> custom variables/function names and for the classic mathematical constant.
> 
> As much as possible you will avoid mixing confusive letters/symbols in that 
> language.
> 
> Confusion is still possible is you use old texts mixing old Greek letters for 
> numerals: you would in that case avoid using the Greek letter pi for naming 
> your custom function, and would reserve the pi letter for the wellknown 
> constant. But applying distinctive styles will enhance your formulas for 
> readability.




Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Hans Åberg via Unicode


> On 7 Jun 2018, at 03:56, Asmus Freytag via Unicode  
> wrote:
> 
> On 6/6/2018 2:25 PM, Hans Åberg via Unicode wrote:
>>> On 4 Jun 2018, at 21:49, Manish Goregaokar via Unicode 
>>>  wrote:
>>> 
>>> The Rust community is considering adding non-ascii identifiers, which 
>>> follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also 
>>> asks for identifiers to be treated as equivalent under NFKC.
>>> 
>> So, in this language, if one defines a projection function 휋 and the usual 
>> constant π, what is 휋(π) supposed to mean? - Just curious.
>> 
> In a language where one writes ASCII "pi" instead, what is pi(pi) supposed to 
> mean?

Indeed.





Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Hans Åberg via Unicode


> On 4 Jun 2018, at 21:49, Manish Goregaokar via Unicode  
> wrote:
> 
> The Rust community is considering adding non-ascii identifiers, which follow 
> UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for 
> identifiers to be treated as equivalent under NFKC.

So, in this language, if one defines a projection function 휋 and the usual 
constant π, what is 휋(π) supposed to mean? - Just curious.





Re: Uppercase ß

2018-05-29 Thread Hans Åberg via Unicode


> On 29 May 2018, at 14:47, Arthur Reutenauer  wrote:
> 
>> The main point is what users of ẞ and ß would think, and Unicode to adjust 
>> accordingly.
> 
>  Since users of ß would think that in the vast majority of cases, it
> ought to be uppercased to SS, I think you’re missing the main point.

No, you missed the point.





Re: Uppercase ß

2018-05-29 Thread Hans Åberg via Unicode


> On 29 May 2018, at 12:55, Arthur Reutenauer  wrote:
> 
>> If uppercasing is not common, one would think that setting it too ẞ would 
>> pose no problems, no that it is available.
> 
>  It would, for reasons of stability.

The main point is what users of ẞ and ß would think, and Unicode to adjust 
accordingly.




Re: Uppercase ß

2018-05-29 Thread Hans Åberg via Unicode


> On 29 May 2018, at 11:17, Werner LEMBERG  wrote:
> 
>> When looking for the lowercase ß LATIN SMALL LETTER SHARP S U+00DF
>> in a MacOS Character Viewer, it does not give the uppercase version,
>> for some reason.
> 
> Yes, and it will stay so, AFAIK.  The uppercase variant of `ß' is
> `SS'.  `ẞ' is to be used mainly for names that contain `ß', and which
> must be printed uppercase, for example in passports.  Here the
> distinction is important, cf.
> 
>  Strauß vs. Strauss  →  STRAUẞ vs. STRAUSS
> 
> Since uppercasing is not common in typesetting German text (in
> particular headers), the need to make a distinction between words like
> `Masse' (mass) and `Maße' (dimensions) if written uppercase is rarely
> necessary because it can usually deduced by context.

If uppercasing is not common, one would think that setting it too ẞ would pose 
no problems, no that it is available.




Re: Uppercase ß

2018-05-29 Thread Hans Åberg via Unicode


> On 29 May 2018, at 10:54, Martin J. Dürst  wrote:
> 
> On 2018/05/29 17:15, Hans Åberg via Unicode wrote:
>>> On 29 May 2018, at 07:30, Asmus Freytag via Unicode  
>>> wrote:
> 
>>> An uppercase exists and it has formally been ruled as acceptable way to 
>>> write this letter (mostly an issue for ALL CAPS as ß does not occur in 
>>> word-initial position).
>>> A./
>> Duden used one in 1957, but stated in 1984 that there is no uppercase 
>> version [1]. So it would be interesting with a reference to an official 
>> version.
>> 1. https://en.wikipedia.org/wiki/ß
> 
> The English wikipedia may not be fully up to date.
> See https://de.wikipedia.org/wiki/Großes_ß (second paragraph):
> 
> "Seit dem 29. Juni 2017 ist das ẞ Bestandteil der amtlichen deutschen 
> Rechtschreibung.[2][3]"
> 
> Translated to English: "Since June 29, 2017, the ẞ is part of the official 
> German orthography."
> 
> (As far as I remember the discussion (on this list?) last year, the ẞ 
> (uppercase ß) is allowed, but not required.)

And it is already in Unicode as ẞ LATIN CAPITAL LETTER SHARP S U+1E9E. When 
looking for the lowercase ß
LATIN SMALL LETTER SHARP S U+00DF in a MacOS Character Viewer, it does not give 
the uppercase version, for some reason.

The equivalence with "ss" shows up ICU Regular Expressions that do case 
insensitive matching where the cases have different length, so it should do 
that for the new character to, I gather.
  http://userguide.icu-project.org/strings/regexp





Re: Uppercase ß

2018-05-29 Thread Hans Åberg via Unicode


> On 29 May 2018, at 07:30, Asmus Freytag via Unicode  
> wrote:
> 
> On 5/28/2018 6:30 AM, Hans Åberg via Unicode wrote:
>>> Unifying these would make a real mess of lower casing!
>>> 
>> German has a special sign ß for "ss", without upper capital version.
>> 
>> 
> You may want to retract the second part of that sentence.
> 
> An uppercase exists and it has formally been ruled as acceptable way to write 
> this letter (mostly an issue for ALL CAPS as ß does not occur in word-initial 
> position). 
> A./

Duden used one in 1957, but stated in 1984 that there is no uppercase version 
[1]. So it would be interesting with a reference to an official version.

1. https://en.wikipedia.org/wiki/ß





Re: Unicode characters unification

2018-05-28 Thread Hans Åberg via Unicode


> On 28 May 2018, at 21:38, Richard Wordingham 
>  wrote:
> 
> On Mon, 28 May 2018 21:14:58 +0200
> Hans Åberg via Unicode  wrote:
> 
>>> On 28 May 2018, at 21:01, Richard Wordingham via Unicode
>>>  wrote:
>>> 
>>> On Mon, 28 May 2018 20:19:09 +0200
>>> Hans Åberg via Unicode  wrote:
>>> 
>>>> Indistinguishable math styles Latin and Greek uppercase letters
>>>> have been added, even though that was not so in for example TeX,
>>>> and thus no encoding legacy to consider.  
>>> 
>>> They sort differently - one can have vaguely alphabetical indexes of
>>> mathematical symbols.  They also have quite different compatibility
>>> decompositions.
>>> 
>>> Does sorting offer an argument for encoding these symbols
>>> differently. I'm not sure it's a strong arguments - how likely is
>>> one to have a list where the difference matters?  
>> 
>> The main point is that they are not likely to be distinguishable when
>> used side-by-side in the same formula. They could be of significance
>> if using Greek names instead of letters, of length greater than one,
>> then. But it is not wrong to add them, because it is easier than
>> having to think through potential uses.
> 
> By these symbols, I meant the quarter-tone symbols.  Capital em and
> capital mu, as symbols, need to be encoded separately for proper
> sorting.

Some of the math style letters are out of order for legacy reasons, so sorting 
may not work well.

SMuFL have different fonts for text and music engraving, but I can't think of 
any use of sorting them.





Re: Unicode characters unification

2018-05-28 Thread Hans Åberg via Unicode


> On 28 May 2018, at 21:01, Richard Wordingham via Unicode 
>  wrote:
> 
> On Mon, 28 May 2018 20:19:09 +0200
> Hans Åberg via Unicode  wrote:
> 
>> Indistinguishable math styles Latin and Greek uppercase letters have
>> been added, even though that was not so in for example TeX, and thus
>> no encoding legacy to consider.
> 
> They sort differently - one can have vaguely alphabetical indexes of
> mathematical symbols.  They also have quite different compatibility
> decompositions.
> 
> Does sorting offer an argument for encoding these symbols differently.
> I'm not sure it's a strong arguments - how likely is one to have a list
> where the difference matters?

The main point is that they are not likely to be distinguishable when used 
side-by-side in the same formula. They could be of significance if using Greek 
names instead of letters, of length greater than one, then. But it is not wrong 
to add them, because it is easier than having to think through potential uses.





Re: Unicode characters unification

2018-05-28 Thread Hans Åberg via Unicode


> On 28 May 2018, at 19:18, Richard Wordingham via Unicode 
>  wrote:
> 
> On Mon, 28 May 2018 17:54:47 +0200
> Hans Åberg via Unicode  wrote:
> 
>>> On 28 May 2018, at 17:00, Richard Wordingham via Unicode
>>>  wrote:
>>> 
>>> On Mon, 28 May 2018 15:30:55 +0200
>>> Hans Åberg via Unicode  wrote:
> 
>>>> German has a special sign ß for "ss", without upper capital
>>>> version.  
>>> 
>>> That doesn't prevent upper-casing - you just have to know your
>>> audience.
>> 
>> That would be the same if the Greek and Latin uppercase letters would
>> have been unified: One would need to know the context.
> 
> I've seen a commutation diagram with both U+004D LATIN CAPITAL LETTER
> M and U+039C GREEK CAPITAL LETTER MU on it.  I only knew the difference
> because I listened to what the lecturer said.

Indistinguishable math styles Latin and Greek uppercase letters have been 
added, even though that was not so in for example TeX, and thus no encoding 
legacy to consider.





Re: Unicode characters unification

2018-05-28 Thread Hans Åberg via Unicode

> On 28 May 2018, at 17:00, Richard Wordingham via Unicode 
> <unicode@unicode.org> wrote:
> 
> On Mon, 28 May 2018 15:30:55 +0200
> Hans Åberg via Unicode <unicode@unicode.org> wrote:
> 
>>> On 28 May 2018, at 15:10, Richard Wordingham via Unicode
>>> <unicode@unicode.org> wrote:
>>> 
>>> On Mon, 28 May 2018 10:08:30 +0200
>>> Hans Åberg via Unicode <unicode@unicode.org> wrote:
>>> 
>>>> It is not about precision, but concepts. Like B, Β, and В, which
>>>> could have been unified, but are not.  
>>> 
>>> Unifying these would make a real mess of lower casing!  
>> 
>> German has a special sign ß for "ss", without upper capital version.
> 
> That doesn't prevent upper-casing - you just have to know your
> audience.  

That would be the same if the Greek and Latin uppercase letters would have been 
unified: One would need to know the context.

> The three letters like 'B' have very different lower case
> forms, and very few would agree that they were the same letter.  

They were the same in the Uncial script, but evolved to be viewed as different. 
That is common with math symbols: something available evolving into separate 
symbols.

> For the
> same reason, there are two utter confusables in THE Latin SCRIPT for
> 00D0 LATIN CAPITAL LETTER ETH.

The stuff is likely added for computer legacy, if there were separate encodings 
for those.

> More notably though, one just has to run
> the risk of getting a culturally incorrect upper case when rendering
> U+014A LATIN CAPITAL LETTER ENG; whether the three alternatives are the
> same letter is debatable.

Unified CJK Ideographs differ by stroke order.





Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-28 Thread Hans Åberg via Unicode

> On 28 May 2018, at 11:05, Julian Bradfield via Unicode <unicode@unicode.org> 
> wrote:
> 
> On 2018-05-28, Hans Åberg via Unicode <unicode@unicode.org> wrote:
>>> On 28 May 2018, at 03:39, Garth Wallace <gwa...@gmail.com> wrote:
>>>> On Sun, May 27, 2018 at 3:36 PM, Hans Åberg <haber...@telia.com> wrote:
>>>> The flats and sharps of Arabic music are semantically the same as in 
>>>> Western music, departing from Pythagorean tuning, then, but the microtonal 
>>>> accidentals are different: they simply reused some that were available.
> ...
>>> The fact that they do not denote the same width in cents in Arabic music as 
>>> they do in Western modern classical does not matter. That sort of precision 
>>> is not inherent to the written symbols.
>> 
>> It is not about precision, but concepts. Like B, Β, and В, which could have 
>> been unified, but are not.
> 
> Latin, Greek, Cyrillic etc. could not have been unified, because of the
> requirement to have round-trip compatibility with previous encodings.

Indeed, in Unicode because of that, which I pointed out.

> It is also, of course, convenient for many reasons to have the notion
> of "script" hard-coded into unicode code-points, instead of in
> higher-level mark-up where it arguably belongs - just as, when
> copyright finally expires, it will be convenient to have Tolkien's
> runes disunified from historical runes (which is the line taken by the
> proposal waiting for that day). Whether it is so convenient to have a
> "music script" notion hard-coded is presumably what this argument is
> about. It's not obvious to me that musical notation is something that
> carries the "script" baggage in the same way that writing systems do.

Indeed, that is what I also pointed out. So I suggested to contact the SMuFL 
people which might inform about the underlying reasoning, and then make a 
decision about what might be suitable for Unicode. They probably have them 
separate for the same reason as for scripts: originally different fonts 
encodings, but those are not official, and in addition it is for music 
engraving, and not writing in text files.





Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-28 Thread Hans Åberg via Unicode

> On 28 May 2018, at 03:39, Garth Wallace <gwa...@gmail.com> wrote:
> 
>> On Sun, May 27, 2018 at 3:36 PM, Hans Åberg <haber...@telia.com> wrote:
>> The flats and sharps of Arabic music are semantically the same as in Western 
>> music, departing from Pythagorean tuning, then, but the microtonal 
>> accidentals are different: they simply reused some that were available.
>> 
> But they aren't different! They are the same symbols. They are, as you 
> yourself say, reused.

Historically, yes, but not necessarily now.

> The fact that they do not denote the same width in cents in Arabic music as 
> they do in Western modern classical does not matter. That sort of precision 
> is not inherent to the written symbols.

It is not about precision, but concepts. Like B, Β, and В, which could have 
been unified, but are not.

> By contrast, Persian music notation invented new microtonal accidentals, 
> called the koron and sori, and my impression is that their average value, as 
> measured by Hormoz Farhat in his thesis, is also usable in Arabic music. For 
> comparison, I have posted the Arabic maqam in Helmholtz-Ellis notation [1] 
> using this value; note that one actually needs two extra microtonal 
> accidentals—Arabic microtonal notation is in fact not complete.
> 
> The E24 exact quarter-tones are suitable for making a piano sound badly out 
> of tune. Compare that with the accordion in [2], Farid El Atrache - 
> "Noura-Noura".
> 
> 1. https://lists.gnu.org/archive/html/lilypond-user/2016-02/msg00607.html
> 2. https://www.youtube.com/watch?v=fvp6fo7tfpk
> > 
> 
> I don't really see how this is relevant. Nobody is claiming that the koron 
> and sori accidentals are the same symbols as the Arabic half-sharp and flat 
> with crossbar. They look entirely different. 

Arabic music simply happens to use Western style accidentals for concepts 
similar to Persian music rather than Western music.





Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-27 Thread Hans Åberg via Unicode
The flats and sharps of Arabic music are semantically the same as in Western 
music, departing from Pythagorean tuning, then, but the microtonal accidentals 
are different: they simply reused some that were available. By contrast, 
Persian music notation invented new microtonal accidentals, called the koron 
and sori, and my impression is that their average value, as measured by Hormoz 
Farhat in his thesis, is also usable in Arabic music. For comparison, I have 
posted the Arabic maqam in Helmholtz-Ellis notation [1] using this value; note 
that one actually needs two extra microtonal accidentals—Arabic microtonal 
notation is in fact not complete.

The E24 exact quarter-tones are suitable for making a piano sound badly out of 
tune. Compare that with the accordion in [2], Farid El Atrache - "Noura-Noura".

1. https://lists.gnu.org/archive/html/lilypond-user/2016-02/msg00607.html
2. https://www.youtube.com/watch?v=fvp6fo7tfpk


> On 27 May 2018, at 22:33, Philippe Verdy <verd...@wanadoo.fr> wrote:
> 
> Thanks! 
> 
> Le dim. 27 mai 2018 22:18, Garth Wallace <gwa...@gmail.com> a écrit :
> Philippe is entirely correct here. The fact that a symbol has somewhat 
> different meanings in different contexts does not mean that it is actually 
> multiple visually identical symbols. Otherwise Unicode would be re-encoding 
> the Latin alphabet many, many times over.
> 
> During most of Bach's career, the prevailing tuning system was meantone. He 
> wrote the Well-Tempered Clavier to explore the possibilities afforded by a 
> new tuning system called well temperament. In the modern era, his work has 
> typically been played in 12-tone equal temperament. That does not mean that 
> the ♯ that Bach used in his score for the Well-Tempered Clavier was not the 
> same symbol as the ♯ in his other scores, or that they somehow invisibly 
> became yet another symbol when the score is opened on the music desk of a 
> modern Steinway.
> 
> On Sat, May 26, 2018 at 2:58 PM, Philippe Verdy <verd...@wanadoo.fr> wrote:
> Even flat notes or rythmic and pause symbols in Western musical notations 
> have different contextual meaning depending on musical keys at start of 
> scores, and other notations or symbols added above the score. So their 
> interpretation are also variable according to context, just like tuning in a 
> Arabic musical score, which is also keyed and annotated differently. These 
> keys can also change within the same partition score.
> So both the E12 vs. E24 systems (which are not incompatible) may also be used 
> in Western and Arabic music notations. The score keys will give the 
> interpretation.
> Tone marks taken isolately mean absolutely nothing in both systems outside 
> the keyed scores in which they are inserted, except that they are just 
> glyphs, which may be used to mean something else (e.g. a note in a comics 
> artwork could be used to denote someone whistling, without actually encoding 
> any specific tone, or rythmic).
> 
> 
> 2018-05-17 17:48 GMT+02:00 Hans Åberg via Unicode <unicode@unicode.org>:
> 
> 
> > On 17 May 2018, at 16:47, Garth Wallace via Unicode <unicode@unicode.org> 
> > wrote:
> > 
> > On Thu, May 17, 2018 at 12:41 AM Hans Åberg <haber...@telia.com> wrote:
> > 
> > > On 17 May 2018, at 08:47, Garth Wallace via Unicode <unicode@unicode.org> 
> > > wrote:
> > > 
> > >> On Wed, May 16, 2018 at 12:42 AM, Hans Åberg via Unicode 
> > >> <unicode@unicode.org> wrote:
> > >> 
> > >> It would be best to encode the SMuFL symbols, which is rather 
> > >> comprehensive and include those:
> > >>  https://www.smufl what should be unified.org
> > >>  http://www.smufl.org/version/latest/
> > >> ...
> > >> 
> > >> These are otherwise originally the same, but has since drifted. So 
> > >> whether to unify them or having them separate might be best to see what 
> > >> SMuFL does, as they are experts on the issue.
> > >> 
> > > SMuFL's standards on unification are not the same as Unicode's. For one 
> > > thing, they re-encode Latin letters and Arabic digits multiple times for 
> > > various different uses (such as numbers used in tuplets and those used in 
> > > time signatures).
> > 
> > The reason is probably because it is intended for use with music engraving, 
> > and they should then be rendered differently.
> > 
> > Exactly. But Unicode would consider these a matter for font switching in 
> > rich text.
> 
> One original principle was ensure different encodings, so if the practise in 
> music engraving is to keep the

Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-17 Thread Hans Åberg via Unicode


> On 17 May 2018, at 16:47, Garth Wallace via Unicode <unicode@unicode.org> 
> wrote:
> 
> On Thu, May 17, 2018 at 12:41 AM Hans Åberg <haber...@telia.com> wrote:
> 
> > On 17 May 2018, at 08:47, Garth Wallace via Unicode <unicode@unicode.org> 
> > wrote:
> > 
> >> On Wed, May 16, 2018 at 12:42 AM, Hans Åberg via Unicode 
> >> <unicode@unicode.org> wrote:
> >> 
> >> It would be best to encode the SMuFL symbols, which is rather 
> >> comprehensive and include those:
> >>  https://www.smufl what should be unified.org
> >>  http://www.smufl.org/version/latest/
> >> ...
> >> 
> >> These are otherwise originally the same, but has since drifted. So whether 
> >> to unify them or having them separate might be best to see what SMuFL 
> >> does, as they are experts on the issue.
> >> 
> > SMuFL's standards on unification are not the same as Unicode's. For one 
> > thing, they re-encode Latin letters and Arabic digits multiple times for 
> > various different uses (such as numbers used in tuplets and those used in 
> > time signatures).
> 
> The reason is probably because it is intended for use with music engraving, 
> and they should then be rendered differently.
> 
> Exactly. But Unicode would consider these a matter for font switching in rich 
> text.

One original principle was ensure different encodings, so if the practise in 
music engraving is to keep them different, they might be encoded differently.

> > There are duplicates all over the place, like how the half-sharp symbol is 
> > encoded at U+E282 as "accidentalQuarterToneSharpStein", at U+E422 as 
> > "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as 
> > "accidentalQuarterToneSharpArabic", and at U+E444 as "accidentalKomaSharp". 
> > They are graphically identical, and the first three even all mean the same 
> > thing, a quarter tone sharp!
> 
> But the tuning system is different, E24 and Pythagorean. Some Latin and Greek 
> uppercase letters are exactly the same but have different encodings.
> 
> Tuning systems are not scripts.

That seems obvious. As I pointed out above, the Arabic glyphs were originally 
taken from Western ones, but have a different musical meaning, also when played 
using E12, as some do.





Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-17 Thread Hans Åberg via Unicode

> On 17 May 2018, at 08:47, Garth Wallace via Unicode <unicode@unicode.org> 
> wrote:
> 
>> On Wed, May 16, 2018 at 12:42 AM, Hans Åberg via Unicode 
>> <unicode@unicode.org> wrote:
>> 
>> It would be best to encode the SMuFL symbols, which is rather comprehensive 
>> and include those:
>>  https://www.smufl what should be unified.org
>>  http://www.smufl.org/version/latest/
>> ...
>> 
>> These are otherwise originally the same, but has since drifted. So whether 
>> to unify them or having them separate might be best to see what SMuFL does, 
>> as they are experts on the issue.
>> 
> SMuFL's standards on unification are not the same as Unicode's. For one 
> thing, they re-encode Latin letters and Arabic digits multiple times for 
> various different uses (such as numbers used in tuplets and those used in 
> time signatures).

The reason is probably because it is intended for use with music engraving, and 
they should then be rendered differently.

> There are duplicates all over the place, like how the half-sharp symbol is 
> encoded at U+E282 as "accidentalQuarterToneSharpStein", at U+E422 as 
> "accidentalWyschnegradsky3TwelfthsSharp", at U+ED35 as 
> "accidentalQuarterToneSharpArabic", and at U+E444 as "accidentalKomaSharp". 
> They are graphically identical, and the first three even all mean the same 
> thing, a quarter tone sharp!

But the tuning system is different, E24 and Pythagorean. Some Latin and Greek 
uppercase letters are exactly the same but have different encodings.

> The last, though meaning something different in Turkish context (Turkish 
> theory divides tones into 1/9-tones), is still clearly the same symbol. The 
> "Arabic accidentals" section even re-encodes all of the non-microtonal 
> accidentals (basic sharp, flat, natural, etc.) for no reason that I can 
> determine.

In Turkish AEU (Arel-Ezgi-Uzdilek) notation the sharp # is a microtonal symbol, 
not the ordinary sharp, so it should be different. In Arabic music, they are 
the same though, so they can be unified.

> There are definitely many things in SMuFL where you could make a claim that 
> they should be in Unicode proper. But not all, and the standard itself is a 
> bit of a mess.

You need to work through those little details to see what fits. Should it help 
with music engraving, or merely be used in plain text? Should symbols that that 
look alike but have different musical meaning be unified?





Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-16 Thread Hans Åberg via Unicode

> On 16 May 2018, at 09:42, Hans Åberg via Unicode <unicode@unicode.org> wrote:
> 
>> On 16 May 2018, at 00:48, Ken Whistler via Unicode <unicode@unicode.org> 
>> wrote:
>> 
>>> A proposal should also show evidence of usage and glyph variations.
>> 
>> And should probably refer to the relationship between these signs and the 
>> existing:
> 
> It would be best to encode the SMuFL symbols, which is rather comprehensive 
> and include those:
> https://www.smufl.org
> http://www.smufl.org/version/latest/
> 
>> U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP
>> U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT
>> 
>> which are also half-sharp or half-flat accidentals.
>> 
>> The wiki on flat signs shows this flat with a crossbar, as well as a 
>> reversed flat symbol, to represent the half-flat.
>> 
>> And the wiki on sharp signs shows this sharp minus one vertical bar to 
>> represent the half-sharp.
>> 
>> So there may be some use of these signs in microtonal notation, outside of 
>> an Arabic context, as well. See:
>> 
>> https://en.wikipedia.org/wiki/Accidental_(music)#Microtonal_notation
> 
> These are otherwise originally the same, but has since drifted. So whether to 
> unify them or having them separate might be best to see what SMuFL does, as 
> they are experts on the issue.

Clarification: The Arabic accidentals, listed here as separate entities
  http://www.smufl.org/version/latest/range/arabicAccidentals/
appear in LilyPond as ordinary microtonal accidentals:
  
http://lilypond.org/doc/v2.18/Documentation/notation/the-feta-font#accidental-glyphs

So what I meant above is that originally, they were the same, i.e., when 
starting to use them in Arabic music, one took some Western microtonal 
accidentals. Now they mean microtones in the style of Arabic music, and the 
musical interpretation varies.




Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-16 Thread Hans Åberg via Unicode

> On 16 May 2018, at 00:48, Ken Whistler via Unicode  
> wrote:
> 
> On 5/15/2018 2:46 PM, Markus Scherer via Unicode wrote:
>> I am proposing the addition of 2 new characters to the Musical Symbols table:
>> 
>> - the half-flat sign (lowers a note by a quarter tone) 
>> - the half-sharp sign (raises a note by a quarter tone)
>> 
>> In an actual proposal, I would expect a discussion of whether you are 
>> proposing to encode established symbols, or whether you are proposing new 
>> symbols to be adopted by the community (in which case Unicode would probably 
>> wait & see if they get established).
>> 
>> A proposal should also show evidence of usage and glyph variations.
> 
> And should probably refer to the relationship between these signs and the 
> existing:

It would be best to encode the SMuFL symbols, which is rather comprehensive and 
include those:
 https://www.smufl.org
 http://www.smufl.org/version/latest/

> U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP
> U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT
> 
> which are also half-sharp or half-flat accidentals.
> 
> The wiki on flat signs shows this flat with a crossbar, as well as a reversed 
> flat symbol, to represent the half-flat.
> 
> And the wiki on sharp signs shows this sharp minus one vertical bar to 
> represent the half-sharp.
> 
> So there may be some use of these signs in microtonal notation, outside of an 
> Arabic context, as well. See:
> 
> https://en.wikipedia.org/wiki/Accidental_(music)#Microtonal_notation

These are otherwise originally the same, but has since drifted. So whether to 
unify them or having them separate might be best to see what SMuFL does, as 
they are experts on the issue.





Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-18 Thread Hans Åberg via Unicode

> On 16 May 2017, at 15:21, Richard Wordingham via Unicode 
> <unicode@unicode.org> wrote:
> 
> On Tue, 16 May 2017 14:44:44 +0200
> Hans Åberg via Unicode <unicode@unicode.org> wrote:
> 
>>> On 15 May 2017, at 12:21, Henri Sivonen via Unicode
>>> <unicode@unicode.org> wrote:  
>> ...
>>> I think Unicode should not adopt the proposed change.  
>> 
>> It would be useful, for use with filesystems, to have Unicode
>> codepoint markers that indicate how UTF-8, including non-valid
>> sequences, is translated into UTF-32 in a way that the original octet
>> sequence can be restored.
> 
> Escape sequences for the inappropriate bytes is the natural technique.
> Your problem is smoothly transitioning so that the escape character is
> always escaped when it means itself. Strictly, it can't be done.
> 
> Of course, some sequences of escaped characters should be prohibited.
> Checking could be fiddly.

One could write the bytes using \xnn escape codes, sequences terminated using 
\& as in Haskell, translating '\' into "\\". It then becomes a C-encoded 
string, not plain text.





Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Hans Åberg via Unicode

> On 17 May 2017, at 23:18, Doug Ewell <d...@ewellic.org> wrote:
> 
> Hans Åberg wrote:
> 
>>> Far from solving the stated problem, it would introduce a new one:
>>> conversion from the "bad data" Unicode code points, currently
>>> well-defined, would become ambiguous.
>> 
>> Actually not: just translate the invalid UTF-8 sequences into invalid
>> UTF-32.
> 
> Far from solving the stated problem, it would introduce TWO new ones...

There is no good solution to the problem of illegal UTF-8 sequences, as the 
intent of those is not known.





Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-17 Thread Hans Åberg via Unicode

> On 17 May 2017, at 22:36, Doug Ewell via Unicode <unicode@unicode.org> wrote:
> 
> Hans Åberg wrote:
> 
>> It would be useful, for use with filesystems, to have Unicode
>> codepoint markers that indicate how UTF-8, including non-valid
>> sequences, is translated into UTF-32 in a way that the original
>> octet sequence can be restored. 
> 
> I have always argued strongly against this idea, and always will.
> 
> Far from solving the stated problem, it would introduce a new one:
> conversion from the "bad data" Unicode code points, currently
> well-defined, would become ambiguous.

Actually not: just translate the invalid UTF-8 sequences into invalid UTF-32. 
No Unicode extensions are needed, as it has no say about what to happen with 
what it considers invalid.

> File systems cannot have it both ways: they must define file names
> either as unrestricted sequences of bytes, or as strings of characters
> in some defined encoding. If they choose the latter, they need to define
> conversion mechanisms with suitable fallback and adhere to them. They
> can use the PUA if they like. 

The latter is complicated, so that is not what one does I am told, with some 
exception. Also, one may end up with a file in an unknown encoding, say 
imported remotely, and then the OS cannot deal with it.





Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode

> On 16 May 2017, at 20:01, Philippe Verdy  wrote:
> 
> On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random 
> sequences of 16-bit code units are not permitted. There's visibly a 
> validation step that returns an error if you attempt to create files with 
> invalid sequences (including other restrictions such as forbidding U+ and 
> some other problematic controls).

For it to work the way I suggested, there would be low level routines that 
handles the names raw, and then on top of that, interface routines doing what 
you describe. On the Austin Group List, they mentioned a filesystem doing it 
directly in UTF-16, and it could have been the one you describe.





Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode

> On 16 May 2017, at 18:38, Alastair Houghton <alast...@alastairs-place.net> 
> wrote:
> 
> On 16 May 2017, at 17:23, Hans Åberg <haber...@telia.com> wrote:
>> 
>> HFS implements case insensitivity in a layer above the filesystem raw 
>> functions. So it is perfectly possible to have files that differ by case 
>> only in the same directory by using low level function calls. The Tenon 
>> MachTen did that on Mac OS 9 already.
> 
> You keep insisting on this, but it’s not true; I’m a disk utility developer, 
> and I can tell you for a fact that HFS+ uses a B+-Tree to hold its directory 
> data (a single one for the entire disk, not one per directory either), and 
> that that tree is sorted by (CNID, filename) pairs.  And since it’s 
> case-preserving *and* case-insensitive, the comparisons it does to order its 
> B+-Tree nodes *cannot* be raw.  I should know - I’ve actually written the 
> code for it!
> 
> Even for legacy HFS, which didn’t store UTF-16, but stored a specified Mac 
> legacy encoding (the encoding used is in the volume header), it’s case 
> sensitive, so the encoding matters.
> 
> I don’t know what tricks Tenon MachTen pulled on Mac OS 9, but I *do* know 
> how the filesystem works.

One could make files that differed by case in the same directory, and Mac OS 9 
did not bother. Legacy HFS tended to slow down with many files in the same 
directory, so that gave an impression of a tree structure. The BSD filesystem 
at the time, perhaps the one that Mac OS X once supported, did not store files 
in a tree, but flat with redundancy.  The other info I got on the Austin Group 
List a decade ago.




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode

> On 16 May 2017, at 18:13, Alastair Houghton <alast...@alastairs-place.net> 
> wrote:
> 
> On 16 May 2017, at 17:07, Hans Åberg <haber...@telia.com> wrote:
>> 
>>>>> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on 
>>>>> UCS-2/UTF-16. ...
>>>> 
>>>> The filesystem directory is using octet sequences and does not bother 
>>>> passing over an encoding, I am told. Someone could remember one that to 
>>>> used UTF-16 directly, but I think it may not be current.
>>> 
>>> No, that’s not true.  All three of those systems store UTF-16 on the disk 
>>> (give or take).
>> 
>> I am not speaking about what they store, but how the filesystem identifies 
>> files.
> 
> Well, quite clearly none of those systems treat the UTF-16 strings as binary 
> either - they’re case insensitive, so how could they?  HFS+ even normalises 
> strings using a variant of a frozen version of the normalisation spec.

HFS implements case insensitivity in a layer above the filesystem raw 
functions. So it is perfectly possible to have files that differ by case only 
in the same directory by using low level function calls. The Tenon MachTen did 
that on Mac OS 9 already.




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode

> On 16 May 2017, at 17:52, Alastair Houghton <alast...@alastairs-place.net> 
> wrote:
> 
> On 16 May 2017, at 16:44, Hans Åberg <haber...@telia.com> wrote:
>> 
>> On 16 May 2017, at 17:30, Alastair Houghton via Unicode 
>> <unicode@unicode.org> wrote:
>>> 
>>> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on 
>>> UCS-2/UTF-16. ...
>> 
>> The filesystem directory is using octet sequences and does not bother 
>> passing over an encoding, I am told. Someone could remember one that to used 
>> UTF-16 directly, but I think it may not be current.
> 
> No, that’s not true.  All three of those systems store UTF-16 on the disk 
> (give or take).

I am not speaking about what they store, but how the filesystem identifies 
files.




Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode

> On 16 May 2017, at 17:30, Alastair Houghton via Unicode <unicode@unicode.org> 
> wrote:
> 
> On 16 May 2017, at 14:23, Hans Åberg via Unicode <unicode@unicode.org> wrote:
>> 
>> You don't. You have a filename, which is a octet sequence of unknown 
>> encoding, and want to deal with it. Therefore, valid Unicode transformations 
>> of the filename may result in that is is not being reachable.
>> 
>> It only matters that the correct octet sequence is handed back to the 
>> filesystem. All current filsystems, as far as experts could recall, use 
>> octet sequences at the lowest level; whatever encoding is used is built in a 
>> layer above. 
> 
> HFS(+), NTFS and VFAT long filenames are all encoded in some variation on 
> UCS-2/UTF-16. ...

The filesystem directory is using octet sequences and does not bother passing 
over an encoding, I am told. Someone could remember one that to used UTF-16 
directly, but I think it may not be current.





Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode

> On 16 May 2017, at 15:00, Philippe Verdy <verd...@wanadoo.fr> wrote:
> 
> 2017-05-16 14:44 GMT+02:00 Hans Åberg via Unicode <unicode@unicode.org>:
> 
> > On 15 May 2017, at 12:21, Henri Sivonen via Unicode <unicode@unicode.org> 
> > wrote:
> ...
> > I think Unicode should not adopt the proposed change.
> 
> It would be useful, for use with filesystems, to have Unicode codepoint 
> markers that indicate how UTF-8, including non-valid sequences, is translated 
> into UTF-32 in a way that the original octet sequence can be restored.
> 
> Why just UTF-32 ?

Synonym for codepoint numbers. It would suffice to add markers how it is 
translated. For example, codepoints meaning "overlong long length ", 
"byte", or whatever is useful.

> How would you convert ill-formed UTF-8/UTF-16/UTF-32 to valid 
> UTF-8/UTF-16/UTF-32 ?

You don't. You have a filename, which is a octet sequence of unknown encoding, 
and want to deal with it. Therefore, valid Unicode transformations of the 
filename may result in that is is not being reachable.

It only matters that the correct octet sequence is handed back to the 
filesystem. All current filsystems, as far as experts could recall, use octet 
sequences at the lowest level; whatever encoding is used is built in a layer 
above. 





Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Hans Åberg via Unicode

> On 15 May 2017, at 12:21, Henri Sivonen via Unicode  
> wrote:
...
> I think Unicode should not adopt the proposed change.

It would be useful, for use with filesystems, to have Unicode codepoint markers 
that indicate how UTF-8, including non-valid sequences, is translated into 
UTF-32 in a way that the original octet sequence can be restored.





Re: How to Add Beams to Notes

2017-05-05 Thread Hans Åberg via Unicode

> On 1 May 2017, at 21:12, Michael Bear via Unicode  wrote:
> 
> I am trying to make a music notation font. It will use the Musical Symbols 
> block in Unicode (1D100-1D1FF), but, since that block has a bad rep for not 
> being very complete, I added some extra characters...

SMuFL has a rather comprehensive set of musical symbols.

http://www.smufl.org/
http://www.smufl.org/version/latest/
http://www.smufl.org/fonts/





Re: Diaeresis vs. umlaut (was: Re: Standaridized variation sequences for the Desert alphabet?)

2017-03-24 Thread Hans Åberg

> On 24 Mar 2017, at 19:33, Doug Ewell  wrote:
> 
> Philippe Verdy wrote:
> 
>> But Unicode just prefered to keep the roundtrip compatiblity with
>> earlier 8-bit encodings (including existing ISO 8859 and DIN
>> standards) so that "ü" in German and French also have the same
>> canonical decomposition even if the diacritic is a diaeresis in French
>> and an umlaut in German, with different semantics and origins.
> 
> Was this only about compatibility, or perhaps also that the two signs
> look identical and that disunifying them would have caused endless
> confusion and misuse among users?

The Swedish letters ÅÄÖ are simplified ligatures, and not diacritic marks. For 
ÄÖ, in handwritten script style, a tilde, the same as Spanish Ñ, which is also 
a simplified ligature.





Re: Northern Khmer on iPhone

2017-02-28 Thread Hans Åberg

> On 28 Feb 2017, at 22:00, Richard Wordingham 
>  wrote:
> 
> On Tue, 28 Feb 2017 07:37:10 +
> Richard Wordingham  wrote:
> 
>> Does iPhone support the use of Northern Khmer in Thai script?  I would
>> count an interface in Thai as support.
> 
> It's been suggested to me that this is just a font issue.
> Unfortunately, it seems that one can't change the font without
> jailbreaking the phone.

A search for "iphone install fonts" gives hits for apps, and also a site that 
sells them and installs via Safari.





Re: Encoding West African Adinkra sysmbols

2017-01-22 Thread Hans Åberg

> On 22 Jan 2017, at 18:21, Michael Everson  wrote:
> 
> Are they used in plain text? How?

On textiles and walls in a similar fashion as emoji, it seems [1]. Known since 
the beginning of the 19th century.

1. https://en.wikipedia.org/wiki/Adinkra_symbols




Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Hans Åberg

> On 10 Oct 2016, at 23:39, Doug Ewell <d...@ewellic.org> wrote:
> 
> Hans Åberg wrote:
> 
>>>>> What do you mean? The IPA in narrow transcription is intended to
>>>>> provide as detailed a description as a human mind can manage of
>>>>> sounds.
>>>> 
>>>> It is designed for phonemic transcriptions, cf.,
>>>> https://en.wikipedia.org/wiki/History_of_the_International_Phonetic_Alphabet
>>> 
>>> It *was* designed, in 1870-something. Try reading the Handbook of the
>>> IPA. It contains many samples of languages transcribed both in a
>>> broad phonemic transcription appropriate for the language, and in a
>>> narrow phonetic transcription which should allow a competent
>>> phonetician to produce an understandable and reasonably accurate
>>> rendition of the passage.
>> 
>> But the alveolar clicks requires an extension.
> 
> You've found ONE instance of non-distorted speech where IPA does not
> distinguish between two allophones. That is very different from saying
> that IPA is unsuitable for phonetic transcription.

There are others, for example, in Dutch, the letter "v" and in "van" is 
pronounced in dialects in continuous variations between [f] and [v] depending 
on the timing of the fricative and the following vowel. It has become popular 
in some dictionaries to use [d] in the AmE where the BrE uses [t], but when 
listening, it sounds more like a [t] drawn towards [d]. The Merriam-Webster 
dictionary has its own system trying to capture variations.

One does not really speak separate consonants and vowels, but they slide over 
and adapt. Describing that is pretty tricky.





Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Hans Åberg

> On 10 Oct 2016, at 23:01, Julian Bradfield <jcb+unic...@inf.ed.ac.uk> wrote:
> 
> On 2016-10-10, Hans Åberg <haber...@telia.com> wrote:
>>> On 10 Oct 2016, at 22:15, Julian Bradfield <jcb+unic...@inf.ed.ac.uk> wrote:
>>> What do you mean? The IPA in narrow transcription is intended to
>>> provide as detailed a description as a human mind can manage of
>>> sounds. It doesn't care whether you're describing differences between
>>> languages or differences within languages (a distinction that is not
>>> in any case well defined).
>> 
>> It is designed for phonemic transcriptions, cf.,
>> https://en.wikipedia.org/wiki/History_of_the_International_Phonetic_Alphabet
> 
> It *was* designed, in 1870-something. Try reading the Handbook of the
> IPA. It contains many samples of languages transcribed both in a broad
> phonemic transcription appropriate for the language, and in a narrow
> phonetic transcription which should allow a competent phonetician to
> produce an understandable and reasonably accurate rendition of the
> passage. Indeed, a couple of decades ago, I participated in a public
> engagement event in which a few of us attempted to do exactly that.

But the alveolar clicks requires an extension.





Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Hans Åberg

> On 10 Oct 2016, at 22:31, Julian Bradfield <jcb+unic...@inf.ed.ac.uk> wrote:
> 
> On 2016-10-10, Hans Åberg <haber...@telia.com> wrote:
>> It is possible to write math just using ASCII and TeX, which was the 
>> original idea of TeX. Is that want you want for linguistics?
> 
> I don't see the need to do everything in plain text. Long ago, I spent
> a great deal of time getting my editor to do semi-wysiwyg TeX maths
> (work later incorporated into x-symbol), but actually it's a waste of
> time and I've given up. 

A fast input method is using text substitutions together with a Unicode capable 
editor generating UTF-8. Then use LuaTeX together with ConTeXt or 
LaTeX/unicode-math.

On MacOS, it works interactively: when a matching input string is detected, it 
is replaced. It does not take long time to design such a text substitutions 
set: I made one for all Unicode math letters, more than a thousand.





Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Hans Åberg

> On 10 Oct 2016, at 22:15, Julian Bradfield <jcb+unic...@inf.ed.ac.uk> wrote:
> 
> On 2016-10-10, Hans Åberg <haber...@telia.com> wrote:
>>> On 10 Oct 2016, at 21:42, Doug Ewell <d...@ewellic.org> wrote:
>>> Hans Åberg wrote:
>>>> I think that IPA might be designed for broad phonetic transcriptions
>>>> [1], with a requirement to distinguish phonemes within each given
>>>> language.
> ...
>>> IPA can be used pretty much as broadly or as narrowly as one wishes.
>> 
>> Within each language, but is not designed to capture differences between 
>> different languages or dialects.
> 
> What do you mean? The IPA in narrow transcription is intended to
> provide as detailed a description as a human mind can manage of
> sounds. It doesn't care whether you're describing differences between
> languages or differences within languages (a distinction that is not
> in any case well defined).

It is designed for phonemic transcriptions, cf.,
https://en.wikipedia.org/wiki/History_of_the_International_Phonetic_Alphabet





Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Hans Åberg

> On 10 Oct 2016, at 21:43, Julian Bradfield  wrote:

> Linguists aren't stupid, and they have no need for plain text
> representations of all their symbology. Linguists write in Word or
> LaTeX (or sometimes HTML), all of which can produce a wide range of
> symbols beyond the wit of Unicode.
> 
> As I have remarked before, I have used "latin letter turned small
> capital K", for reasons that seemed good to me, and I was not one whit
> restrained by its absence from Unicode - nor was the journal.

It is possible to write math just using ASCII and TeX, which was the original 
idea of TeX. Is that want you want for linguistics?





Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Hans Åberg

> On 10 Oct 2016, at 21:42, Doug Ewell <d...@ewellic.org> wrote:
> 
> Hans Åberg wrote:
> 
>> I think that IPA might be designed for broad phonetic transcriptions
>> [1], with a requirement to distinguish phonemes within each given
>> language.
> 
> From the Wikipedia article you cited:
> 
> "For example, one particular pronunciation of the English word little
> may be transcribed using the IPA as /ˈlɪtəl/ or [ˈlɪɾɫ̩]; the
> broad, phonemic transcription, placed between slashes, indicates merely
> that the word ends with phoneme /l/, but the narrow, allophonic
> transcription, placed between square brackets, indicates that this final
> /l/ ([ɫ]) is dark (velarized)."
> 
> IPA can be used pretty much as broadly or as narrowly as one wishes.

Within each language, but is not designed to capture differences between 
different languages or dialects.





Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Hans Åberg

> On 10 Oct 2016, at 15:24, Julian Bradfield <jcb+unic...@inf.ed.ac.uk> wrote:
> 
> On 2016-10-10, Hans Åberg <haber...@telia.com> wrote:
>> I think that IPA might be designed for broad phonetic transcriptions
>> [1], with a requirement to distinguish phonemes within each given
>> language. For example, the English /l/ is thicker than the Swedish,
>> but in IPA, there is only one symbol, as there is no phonemic
>> distinction with each language. The alveolar click /!/ may be
>> pronounced with or without the tongue hitting the floor of the
>> mouth, but as there is not phonemic distinction within any given
>> language, there is only one symbol [2]. 
> 
> But the IPA has many diacritics exactly for this purpose.
> The velarized English coda /l/ is usually described as [l̴]
> with U+0334 COMBINING TILDE OVERLAY, or can be notated [lˠ]
> with U+02E0 MODIFIER LETTER SMALL GAMMA.
> 
> The alveolar click with percussive flap hasn't made it into the
> standard IPA, but in ExtIPA it's [ǃ¡] (preferably kerned together).

There is ‼ DOUBLE EXCLAMATION MARK U+203C which perhaps might be used.

>> Thus, linguists wanting to describe pronunciation in more detail are left at 
>> improvising notation. The situation is thus more like that of mathematics, 
>> where notation is somewhat in flux.
> 
> There is improvisation when you're studying something new, of course,
> but there's a lot of standardization.

The preceding discussion was dealing additions to Unicode one-by-one—the 
question is what might be added so that linguists do not feel restrained.





Re: Why incomplete subscript/superscript alphabet ?

2016-10-10 Thread Hans Åberg

> On 10 Oct 2016, at 03:13, Doug Ewell  wrote:
> 
> Denis Jacquerye wrote:
> 
>> Regarding the superscript q, in some rare cases, it is used to
>> indicate pharyngealization or a pharyngeal consonant instead of the
>> Latin letter pharyngeal voiced fricative U+0295 ʕ, the modifier letter
>> reversed glottal stop U+02C1 ˁ or the modifier letter small reversed
>> glottal stop U+02E4 ˤ.
>> ...
> 
> Sounds like good material to include in a proposal.

I think that IPA might be designed for broad phonetic transcriptions [1], with 
a requirement to distinguish phonemes within each given language. For example, 
the English /l/ is thicker than the Swedish, but in IPA, there is only one 
symbol, as there is no phonemic distinction with each language. The alveolar 
click /!/ may be pronounced with or without the tongue hitting the floor of the 
mouth, but as there is not phonemic distinction within any given language, 
there is only one symbol [2].

Thus, linguists wanting to describe pronunciation in more detail are left at 
improvising notation. The situation is thus more like that of mathematics, 
where notation is somewhat in flux.

1. https://en.wikipedia.org/wiki/Phonetic_transcription
2. https://en.wikipedia.org/wiki/Alveolar_clicks





Re: Bit arithmetic on Unicode characters?

2016-10-09 Thread Hans Åberg

> On 9 Oct 2016, at 13:00, Mark Davis ☕️  wrote:
> 
> Essentially all of the game pieces that are in Unicode were added for 
> compatibility with existing character sets. ​I'm guessing that ​there are 
> hundreds to thousands of possible other symbols associated with games in one 
> way or another, 

There is http://www.chessvariants.com/.






Re: Why incomplete subscript/superscript alphabet ?

2016-10-08 Thread Hans Åberg

> On 8 Oct 2016, at 12:03, Julian Bradfield  wrote:
> 
> I happen to think the whole math alphabet thing was a dumb
> mistake.

They are useful in mathematics, but other sciences may not use them.

> But even if it isn't - and incidentally in some communities
> there is or was a convention of using blackboard bold letters for
> matrices, which justifies all of them -:

The double-struck letters are popular among mathematicians.

>> I believe the same logic applies to the case of linguistics, where the use
>> of superscripts are a common convention.
> 
> Either superscripts are being used mathematically, in which case you
> can use mathematical markup, …

The principle for adding stuff to Unicode, I think, was that the semantics 
should be expressible in a text-only file, modulo what the technology is able 
to express.

For math, it is not known exactly what is required to express it semantically. 
TeX treats it as syntactic markup, for example, for superscripts and subscripts 
on the left hand side, and tensor component notation.

Rendering technologies have evolved, though, so from that point of view, more 
would be possible today.





Re: Bit arithmetic on Unicode characters?

2016-10-07 Thread Hans Åberg

> On 7 Oct 2016, at 18:06, Doug Ewell  wrote:

> I can't find anything in the UCD that distinguishes one "font variant"
> from another (UnicodeData.txt shown as an example):
> 
> 1D400;MATHEMATICAL BOLD CAPITAL A;Lu;0;L; 0041N;
> 1D434;MATHEMATICAL ITALIC CAPITAL A;Lu;0;L; 0041N;
> 1D468;MATHEMATICAL BOLD ITALIC CAPITAL A;Lu;0;L; 0041N;
> 1D49C;MATHEMATICAL SCRIPT CAPITAL A;Lu;0;L; 0041N;
> 1D4D0;MATHEMATICAL BOLD SCRIPT CAPITAL A;Lu;0;L; 0041N;
> 1D504;MATHEMATICAL FRAKTUR CAPITAL A;Lu;0;L; 0041N;
> 1D538;MATHEMATICAL DOUBLE-STRUCK CAPITAL A;Lu;0;L; 0041N;
> 1D56C;MATHEMATICAL BOLD FRAKTUR CAPITAL A;Lu;0;L; 0041N;
> 1D5A0;MATHEMATICAL SANS-SERIF CAPITAL A;Lu;0;L; 0041N;
> 1D5D4;MATHEMATICAL SANS-SERIF BOLD CAPITAL A;Lu;0;L;
> 0041N;
> 1D608;MATHEMATICAL SANS-SERIF ITALIC CAPITAL A;Lu;0;L;
> 0041N;
> 1D63C;MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL A;Lu;0;L;
> 0041N;
> 1D670;MATHEMATICAL MONOSPACE CAPITAL A;Lu;0;L; 0041N;
> 
> And that's probably as it should be, because UTC never intended MAS to
> be readily transformed to and from "plain" characters. They're supposed
> to be used for mathematical expressions in which styled letters have
> special meaning.

I use them for input text files, and it is not particularly difficult. An 
efficient method is to use text substitutions, as available on MacOS. The 
resulting file is UTF-8 with the correct character, and typesetting systems 
like LuaTeX with ConTeXt or LaTeX/unicode-math translates it into a PDF. It is 
usually easy to immediately spot if a math style is wrong. Using it in the 
input makes one more aware of new styles that in the past was not available.





Re: Bit arithmetic on Unicode characters?

2016-10-07 Thread Hans Åberg

> On 7 Oct 2016, at 09:27, Garth Wallace  wrote:
> 
> Unicode doesn't really address chess piece properties like white/black beyond 
> naming conventions.

>From the formal point of view, Unicode only assigns character numbers (code 
>points), which gets a binary representation first when encoded, like with 
>UTF-8 which makes it agree with ASCII for small numbers. The math alphabetical 
>letters are out of order because of legacy, but that is not a problem as one 
>will use an interface that sorts it out. These numbers are only for display to 
>humans, and computers are nowadays fast enough to sort it out. A chess program 
>has its own, optimized representation anyway.

So possibly you might add more properties.





Re: Why incomplete subscript/superscript alphabet ?

2016-10-01 Thread Hans Åberg

> On 1 Oct 2016, at 15:48, lorieul  wrote:

> Indeed Latex formulas are often not easy to
> decypher… 

One can improve readability by using more Unicode characters [1] and the 
unicode-math package [2], or switching to ConTeXt 
, which has builtin support.

1. http://milde.users.sourceforge.net/LUCR/Math/unimathsymbols.xhtml
2. https://www.ctan.org/pkg/unicode-math





Re: Math upright Latin and Greek styles

2016-05-16 Thread Hans Åberg

> On 16 May 2016, at 18:56, Philippe Verdy  wrote:
> 
> I do not advocate changing that, but these legacy *TeX variants have their 
> own builtin sets of supported fonts with their implicit style and use them 
> with the normal letters, just like what is done in HTML when you apply an 
> italic style. Has these *TeX variants exist this way they don't need these 
> additions that will be needed only on newer *TeX variants that will not use 
> explicit font variants in their encoding, but directly new distinguished code 
> points (without explicit font style tagging).

The ConTeXt macro package default engine is LuaTeX, which uses UTF-8 for text 
files and UTF-32 internally, and combines the effort of several of those other, 
older versions. Then one can use the STIX fonts (or XITS) which are Unicode.





Re: Math upright Latin and Greek styles

2016-05-16 Thread Hans Åberg

> On 16 May 2016, at 03:30, Philippe Verdy  wrote:
> 
> isn't it specified in TeX using a font selection package instead of the 
> default one? Also the only upright letters I saw was for inserting normal 
> text (not mathematical symbols) or comments/descriptions, or when using the 
> standardized "monospace", or "serif" font (which are not italic by default).

Most use a macro package like ConTeXt, which is more recent and modern than 
LaTeX, and it is not difficult to change so that the Basic Latin produces math 
upright style. But legacy is that it is used for math italic, and it is hard to 
change that legacy.





Re: Math upright Latin and Greek styles

2016-05-15 Thread Hans Åberg

> On 16 May 2016, at 00:05, Murray Sargent <murr...@exchange.microsoft.com> 
> wrote:
> 
> Hans Åberg mentioned "Changing Basic Latin and Greek to upright does not seem 
> practical, due to legacy and lack of efficient input methods."
> 
> Have to say that it's really easy for the user to switch between math 
> upright, italic, bold, and bold italic letters in Microsoft Word by just 
> using the usual hot keys as discussed in 
> 
> https://blogs.msdn.microsoft.com/murrays/2007/05/30/using-math-italic-and-bold-in-word-2007/.
>  
> 
> This capability has been shipping for over 10 years now. But admittedly 
> implementing such input functionality is a little tricky since the 
> alphanumerics need to be converted to the desired Unicode Math Alphanumerics.

I am not familiar with the product, so it unclear to me whether it it produces 
a UTF-8 text file with the correct Unicode code points, as is a requirement for 
the LuaTeX engine that ConTeXt defaults to. One can design a new key map on OS 
X that selects the correct Unicode code points, but that is a huge task, given 
the large number of math symbols.

The legacy issue is that there are already loads of TeX code that translates 
the Basic Latin into Unicode math italic style. So it is hard to break the 
habit, and old code cannot readily be reused.

And one can ignore the problem altogether, and use the traditional TeX 
backslash “\…” commands, but using Unicode helps the readability of the source 
code. This is even more so in the case of theorem proof assistants.




Re: Math upright Latin and Greek styles

2016-05-15 Thread Hans Åberg

> On 15 May 2016, at 23:19, Murray Sargent <murr...@exchange.microsoft.com> 
> wrote:
> 
> Hans Åberg asked, ”Are there any plans to add math upright Latin and Greek 
> styles, in order to distinguish them from regular (non-math) Latin and Greek? 
> —In programs like TeX, the latter are normally used for italics, so it means 
> that there is a conflict with using them for upright”.
>  
> Math upright Latin is unified with the ASCII alphabetics and math upright 
> Greek is unified with Unicode Greek letters in the U+0390 block. TeX and 
> MathML upright Latin and upright lower-case Greek letters are converted to 
> math italic by default. In the Linear Format, upright letters are enclosed in 
> quotes and marked as “ordinary text”. In Microsoft Word and other Microsoft 
> Office apps, you can control math italicization in math zones using the 
> italics hot key Ctrl+I and other italic formatting tools.
>  
> There is ambiguity as to whether a span of upright ASCII alphabetics is a 
> function name or a product or a combination of the two. Such ambiguities are 
> rare since spans of upright ASCII alphabetics are usually words or 
> abbreviations of some kind such as function names. Individual upright letters 
> can be distinguished as individual variables if desired by inserting 
> appropriate invisible times (U+2062) characters.
>  
> We are thinking about adding other math alphabets as discussed in the post 
> Unicode Math Calligraphic Alphabets. Comments are welcome.

The question arose on the ConTeXt mailing list [1]. Changing Basic Latin and 
Greek to upright does not seem practical, due to legacy and lack of efficient 
input methods. So the idea came up to have these reserved for text and computer 
input, while a specific math upright style would be used when wanting to 
indicate that.

1. https://mailman.ntg.nl/pipermail/ntg-context/2016/085523.html





Re: Unicode in passwords

2015-09-30 Thread Hans Åberg

> On 30 Sep 2015, at 18:33, John O'Conner  wrote:
> 
> Can you recommend any documents to help me understand potential issues (if 
> any) for password policies and validation methods that allow characters from 
> more "exotic" portions of the Unicode space?

On UNIX computers, one computes a hash (like SHA-256), which is then used to 
authenticate the password up to a high probability. The hash is stored in the 
open, but it is not known how to compute the password from the hash, so knowing 
the hash does not easily allow authentication.

So if the password is encoded in say UTF-8 and then hashed, it would seem to 
take care of most problems.





Re: Chess symbol glyphs in code charts

2015-08-14 Thread Hans Åberg

 On 14 Aug 2015, at 20:31, Garth Wallace gwa...@gmail.com wrote:
 
 Can anyone tell me what font is used for the chess symbols in the code
 chart for the Miscellaneous Symbols block? It looks a lot like Chess
 Merida but I can't be certain.

They are quite close to Apple Symbols, but not exactly the same.