Re: Is the binaryness/textness of a data format a property?

2020-03-22 Thread Martin J . Dürst via Unicode
On 23/03/2020 03:56, Markus Scherer via Unicode wrote:
> On Sat, Mar 21, 2020 at 12:35 PM Doug Ewell via Unicode 
> wrote:
> 
>> I thought the whole premise of GB18030 was that it was Unicode mapped into
>> a GB2312 framework. What characters exist in GB18030 that don't exist in
>> Unicode, and have they been proposed for Unicode yet, and why was none of
>> the PUA space considered appropriate for that in the meantime?
>>
> 
> My memory of GB18030 is that its code space has 1.6M code points, of which
> 1.1M are a permutation of Unicode. For the rest you would have to go beyond
> the Unicode code space for 1:1 round-trip mappings.

This matches my recollection. What's more, there are no characters 
allocated in the parts of the GB 18030 codespace that doesn't map to 
Unicode, and there is as far as I understand no plan to use that space. 
It's just there because that was the most straightforward way to extend 
GB 2312/GBK.

Regards,   Martin.



Re: Is the binaryness/textness of a data format a property?

2020-03-20 Thread Martin J . Dürst via Unicode
On 20/03/2020 23:41, Adam Borowski via Unicode wrote:

> Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF or
> U+11000..U+7FFF (or possibly even up to 2³⁶ or 2⁴²), which has its uses
> but is not well-formed Unicode.

This would definitely no longer be UTF-8!   Martin.



Call for Papers: G21C Grapholinguistics in the 21st century, Paris June 2020

2020-01-06 Thread Martin J . Dürst via Unicode
Happy New Year to everybody on this list!

Except for the Internationalization and Unicode Conference (see 
https://www.unicodeconference.org/; submission deadline March 6, 2020), 
this list very rarely sees calls for papers, but this one should 
definitely be of interest at least to a subset of people on this list 
(mostly those with academic/theoretic inclinations). The submission 
deadline is very close (January 13), but I have heard there may be an 
extension.


#CfP2 message#

[Apologies if you receive multiple copies of this message]



Second CALL FOR PAPERS

G21C Grapholinguistics in the 21st century—From graphemes to knowledge



June 17-18-19, 2020

Paris, France

Contact: Yannis Haralambous

grafematik2...@sciencesconf.org   or   grafematik2...@easychair.org





G21C (Grapholinguistics in the 21st Century) is a biennial conference
bringing together disciplines concerned with grapholinguistics and more
generally the study writing systems and their representation in written
communication. The conference aims to reflect on the current state of
research in the area, and on the role that writing and writing systems play
in neighboring disciplines like computer science and information
technology, communication, typography, psychology, and pedagogy. In
particular it aims to study the effect of the growing importance of Unicode
with regard to the future of reading and writing in human societies.
Reflecting the richness of perspectives on writing systems, G21C is
actively interdisciplinary, and welcomes proposals from researchers from
the fields of computer science and information technology, linguistics,
communication, pedagogy, psychology, history, and the social sciences.



G21C aims to create a space for the discussion of the range of approaches
to writing systems, and specifically to bridge approaches in linguistics,
informatics, and other fields. It will provide a forum for explorations in
terminology, methodology, and theoretical approaches relating to the
delineation of an emerging interdisciplinary area of research that
intersects with intense activity in practical implementations of writing
systems.



The first edition of G21C was held in Brest, France, on June 14-15, 2018.
All presentations have been recorded and can be watched on
http://conferences.telecom-bretagne.eu/grafematik/



***

Keynote speakers

***

Jessica Coon, Associate Professor, Department of Linguistics, McGill
University, Montréal, Canada:

“The Linguistics of Arrival: What an alien writing system can teach us
about human language”



Martin Neef, Professor, Institut für Germanistik, TU Braunschweig,
Braunschweig, Germany:

“What is it that ends with a full stop?”



***

Main topics of interest

***



We welcome original proposals from all disciplines concerned with the study
of written language, writing systems, and their implementation in
information systems. Examples of topics include, but are not limited to:



Epistemology of grapholinguistics: history, onomastics, topics, interaction
with other disciplines

Foundations of grapholinguistics, graphemics and graphetics

History and typology of writing systems, comparative graphemics/graphetics

Semiotics of writing and of writing systems

Computational/formal graphemics/graphetics

Grapholinguistic theory of Unicode encoding

Orthographic reforms, theory and practice

Graphemics/graphetics and multiliteracy

Writing and Art / Writing in Art

Sinographemics

Typographemics, typographetics

Texting, latinization, new forms of written language

ASCII art, emoticons and other pictorial uses of graphemes

The future of writing, of writing systems and styles

Graphemics/graphetics and font technologies

Graphemics/graphetics in steganography and computer security (phishing,
typosquatting, etc.)

Graphemics/graphetics in art, media and communication / Aesthetics of
writing in the digital era

Graphemics/graphetics in experimental psychology and cognitive sciences

Teaching graphemics/graphetics, the five Ws and one H

Grapholinguistic applications in natural language processing and text mining

Grapholinguistic applications in optical character recognition and
information technologies





Program committee





Gabriel Altmann, formerly at Ruhr-Universität Bochum, Germany

Jannis Androutsopoulos, Universität Hamburg, Germany

Vlad Atanasiu, Université de Fribourg, Switzerland

Kristian Berg, Universität Oldenburg, Germany

Peter Bilak, Typothèque, The Hague, The Netherlands

Florian Coulmas, Universität Duisburg, Germany

Jacques David, Université de Cergy-Pontoise, France

Mark Davis, Unicode Consortium & Google Inc., Switzerland

Joseph Dichy, Université Lumière Lyon 2, France

Christa Dürscheid, 

Re: Grapheme clusters and backspace (was Re: Coding for Emoji: how to modify programs to work with emoji)

2019-10-22 Thread Martin J . Dürst via Unicode
Hello Richard, others,

On 2019/10/23 07:32, Richard Wordingham via Unicode wrote:
> On Tue, 22 Oct 2019 23:27:27 +0200
> Daniel Bünzli via Unicode  wrote:

>> Just to make things clear. When you say character in your message,
>> you consistently mean scalar value right ?
> 
> Yes.
> 
> I find it hard to imagine that having to type them doesn't endow then
> with some sort of reality in the users' minds, though some, such as
> invisible stackers, are probably envisaged as control characters.

I think this to some extent is a question of "reality in the users' 
minds". But to a very large extent, this is an issue of muscle memory. 
If a user works with a keyboard/input method that deletes a whole 
combination, their muscles will get used to that the same way they will 
get used to the other case.

Users are perfectly capable of talking about characters and in the same 
sentence use that word once for something like individual codepoints and 
later for a whole combination.

> One does come across some odd entry methods, such as typing an Indic
> akshara using the Latin script and then entering it as a whole.  That
> is no more conducive to seeing the constituents as characters than is
> typing wab- to get the hieroglyph ヂ.

The input of Japanese Kana is usually done from a Latin keyboard. As an 
example, to input the syllable "ka" (か), one presses the keys for 'k' 
and 'a'. In all the IMEs I have used, a backspace deletes the whole
"か", not only the 'a'. One has to get used to it (I still occasionally 
want to press two backspaces when realizing I made a typo), but one gets 
used to it.

There are also cases such as "kya" → "きゃ", where the three Latin 
keyboard presses cannot be allocated 2-1 or 1-2 to the two resulting 
Hiragana. In a sophisticated implementation, a backspace could go from
"きゃ" to "ky", but that would only work immediately after input.

Of course, for Japanese input, Latin → Kana is only the first layer, the 
second layer is Kana → Kanji.

Regards,   Martin.




Re: Unicode website glitches. (was The Most Frequent Emoji)

2019-10-11 Thread Martin J . Dürst via Unicode
Hello Mark,

On 2019/10/12 02:35, Mark Davis ☕️ wrote:
> There was a caching problem with WordPress, where you have to do a hard
> reload in some browsers. See if the problem still exists, and if the hard
> reload fixes it. If anyone else is having trouble with that, let us know.

I can confirm that a hard reload fixed the problem.


> BTW, if you want to comment on the format as opposed to glitches, please
> change the subject line.

I think it's less the format and much more the split personality of the 
Unicode Web site(s?) that I have problems with.

Regards,   Martin.

> Mark
> 
> 
> On Thu, Oct 10, 2019 at 11:50 PM Martin J. Dürst via Unicode <
> unicode@unicode.org> wrote:
> 
>> I had a look at the page with the frequencies. Many emoji didn't
>> display, but that's my browser's problem. What was worse was that the
>> sidebar and the stuff at the bottom was all looking weird. I hope this
>> can be fixed.
>>
>> Regards,   Martin.

>> The new Unicode Emoji Frequency
>> <https://home.unicode.org/emoji/emoji-frequency> page shows a list of
>> the Unicode v12.0 emoji ranked in order of how frequently they are used.



Fwd: The Most Frequent Emoji

2019-10-11 Thread Martin J . Dürst via Unicode
I had a look at the page with the frequencies. Many emoji didn't 
display, but that's my browser's problem. What was worse was that the 
sidebar and the stuff at the bottom was all looking weird. I hope this 
can be fixed.

Regards,   Martin.

 Forwarded Message 
Subject: The Most Frequent Emoji
Date: Wed, 09 Oct 2019 07:56:37 -0700
From: announceme...@unicode.org
Reply-To: r...@unicode.org
To: announceme...@unicode.org

Emoji Frequency ImageHow does the Unicode Consortium choose which new 
emoji to add? One important factor is data about how frequently the 
current emoji are used. Patterns of usage help to inform decisions about 
future emoji. The Consortium has been working to assemble this 
information and make it available to the public.

And the two most frequently used emoji in the world are...
 and ❤️
The new Unicode Emoji Frequency 
 page shows a list of 
the Unicode v12.0 emoji ranked in order of how frequently they are used.

“The forecasted frequency of use is a key factor in determining whether 
to encode new emoji, and for that it is important to know the frequency 
of use of existing emoji,” said Mark Davis, President of the Unicode 
Consortium. “Understanding how frequently emoji are used helps 
prioritize which categories to focus on and which emoji to add to the 
Standard.”


/Over 136,000 characters are available for adoption 
, to help the 
Unicode Consortium’s work on digitally disadvantaged languages./

[badge] 

http://blog.unicode.org/2019/10/the-most-frequent-emoji.html


<<< text/html; name="Attached Message Part": Unrecognized >>>


Re: Manipuri/Meitei customary writing system

2019-10-04 Thread Martin J . Dürst via Unicode
On 2019/10/04 15:35, Martin J. Dürst via Unicode wrote:
> Hello Markus,
> 
> On 2019/10/04 01:53, Markus Scherer via Unicode wrote:
>> Dear Unicoders,
>>
>> Is Manipuri/Meitei customarily written in Bangla/Bengali script or
>> in Meitei script?
>>
>> I am looking at
>> https://en.wikipedia.org/wiki/Meitei_language#Writing_systems which seems
>> to describe writing practice in transition, and I can't quite tell where it
>> stands.
>>
>> Is the use of the Meitei script aspirational or customary?
>> Which script is being used for major newspapers, popular books, and video
>> captions?
> 
> This may give you some more information:
> https://www.atypi.org/conferences/tokyo-2019/programme/activity?a=906

Sorry, this should have been two separate URIs (about the same talk).

> https://www.youtube.com/watch?v=S8XxVZkfUkk
> 
> It's a recent talk at ATypI in Tokyo (sponsored by Google, among others).
> 
> Regards,   Martin.
> 



Re: Manipuri/Meitei customary writing system

2019-10-04 Thread Martin J . Dürst via Unicode
Hello Markus,

On 2019/10/04 01:53, Markus Scherer via Unicode wrote:
> Dear Unicoders,
> 
> Is Manipuri/Meitei customarily written in Bangla/Bengali script or
> in Meitei script?
> 
> I am looking at
> https://en.wikipedia.org/wiki/Meitei_language#Writing_systems which seems
> to describe writing practice in transition, and I can't quite tell where it
> stands.
> 
> Is the use of the Meitei script aspirational or customary?
> Which script is being used for major newspapers, popular books, and video
> captions?

This may give you some more information:
https://www.atypi.org/conferences/tokyo-2019/programme/activity?a=906https://www.youtube.com/watch?v=S8XxVZkfUkk

It's a recent talk at ATypI in Tokyo (sponsored by Google, among others).

Regards,   Martin.



Re: Emoji Haggadah

2019-04-16 Thread Martin J . Dürst via Unicode
Hello Mark, others,

On 2019/04/16 12:18, Mark E. Shoulson via Unicode wrote:
> Yes.  But the sentences aren't just symbolic representations of the 
> concepts or something.  They are frequently direct 
> transcriptions—usually by puns—for *English* sentences, so left-to-right 
> makes sense.  So for example, the phrase "️⌛️️" translates "The LORD 
> our God".  For whatever reason, the author decided to go with ️ for 
> "God" and such, and the hourglass in the middle is for "our", which 
> sounds like "hour".  See?  Ugh.  I think he uses  for "us" (U.S. = 
> us). In the story of the five Rabbis discussing the laws in Bnei Brak, 
> for one thing the word "Rabbi" is transcribed  ("rabbit" instead of 
> "rabbi"), and it says they were in "" (boy - boy - 
> cloud-with-lightning).  The two boys for "sons" (which translates the 
> word "Bnei" in the name of the city), and the lightning, "barak" in 
> Hebrew, is for "brak", the second part of the name. The front cover, 
> which you can see on the amazon page... That  (shell) in the title? 
> Because it's saying "Haggadah shel Pesach", the Hebrew word "shel" 
> meaning "of."  The author's name?  ♥♢♣♠ (or whatever the exact 
> ordering is): "Martin Bodek", that is martini-glass, bow, and the four 
> suits of a DECK of cards.  Sorry; see what I mean about getting carried 
> away by being able to read the silly thing?  Anyway.  The sentences are 
> definitely ENGLISH sentences, not Hebrew or any sort of language-neutral 
> semasiography or whatever, so LTR ordering makes sense (to the extent 
> any of this makes sense.)

All the examples you cite, where images stand for sounds, are typically 
used in some of the oldest "ideographic" scripts. Egyptian definitely 
has such concepts, and Han (CJK) does so, too, with most ideographs 
consisting of a semantic and a phonetic component.

There is a well-known thesis in linguistics that every script has to be 
at least in part phonetic, and the above are examples that add support 
to this. For deeper explanations (unfortunately not yet including 
emoji), see e.g. "Visible Speech - The Diverse Oneness of Writing 
Systems", by John DeFrancis, University of Hawaii Press, 1989.

Regards,   Martin.


> ~mark
> 
> On 4/15/19 10:56 PM, Beth Myre via Unicode wrote:
>> This is amazing.
>>
>> It's also really interesting that he decided to make the sentences 
>> read left-to-right.
>>
>> On Mon, Apr 15, 2019 at 10:05 PM Tex via Unicode > > wrote:
>>
>>     Oy veh!
>>
>>     *From:*Unicode [mailto:unicode-boun...@unicode.org
>>     ] *On Behalf Of *Mark E.
>>     Shoulson via Unicode
>>     *Sent:* Monday, April 15, 2019 5:27 PM
>>     *To:* unicode@unicode.org 
>>     *Subject:* Emoji Haggadah
>>
>>     The only thing more disturbing than the existence of The Emoji
>>     Haggadah
>>     (https://www.amazon.com/Emoji-Haggadah-Martin-Bodek/dp/1602803463/)
>>     is the fact that I'm starting to find that I can read it...
>>
>>     ~mark




Re: Encoding italic

2019-02-09 Thread Martin J . Dürst via Unicode
On 2019/02/09 19:58, Richard Wordingham via Unicode wrote:
> On Fri, 8 Feb 2019 18:08:34 -0800
> Asmus Freytag via Unicode  wrote:

>> Under the implicit assumptions bandied about here, the VS approach
>> thus reveals itself as a true rich-text solution (font switching)
>> albeit realized with pseudo coding rather than markup, markdown or
>> escape sequences.
> 
> Isn't that already the case if one uses variation sequences to choose
> between Chinese and Japanese glyphs?

Well, not necessarily. There's nothing prohibiting a font that includes 
both Chinese and Japanese glyph variants.

Regards,   Martin.



Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Martin J . Dürst via Unicode
On 2019/01/31 07:02, Richard Wordingham via Unicode wrote:
> On Wed, 30 Jan 2019 15:33:38 +0100
> Frédéric Grosshans via Unicode  wrote:
> 
>> Le 30/01/2019 à 14:36, Egmont Koblinger via Unicode a écrit :
>>> - It doesn't do Arabic shaping. In my recommendation I'm arguing
>>> that in this mode, where shuffling the characters is the task of
>>> the text editor and not the terminal, so should it be for Arabic
>>> shaping using presentation form characters.
>>
>> I guess Arabic shaping is doable through presentation form
>> characters, because the latter are character inherited from legacy
>> standards using them in such solutions.
> 
> So long as you don't care about local variants, e.g. U+0763 ARABIC
> LETTER KEHEH WITH THREE DOTS ABOVE.  It has no presentation form
> characters.

Same also for characters used for other languages than Arabic.

> Basic Arabic shaping, at the level of a typewriter, is straightforward
> enough to leave to a terminal emulator, as Eli has suggested.  Lam-alif
> would be trickier - one cell or two?

Same for other characters. A medial Beh/Teh/Theh/... (ببب) in any 
reasonably decent rendering should take quite a bit less space than a 
Seen or Sheen (سسس). I remember that the multilingual Emacs version 
mostly written by Ken'ichi Handa (was it called mEmacs or nEmacs or 
something like that?) had different widths only just for Arabic. In 
Thunderbird, which is what I'm using here, I get hopelessly 
stretched/squeezed glyph shapes, which definitely don't look good.

Regards,   Martin.



Re: Encoding italic

2019-01-29 Thread Martin J . Dürst via Unicode
On 2019/01/28 05:03, James Kass via Unicode wrote:
> 
> A new beta of BabelPad has been released which enables input, storing, 
> and display of italics, bold, strikethrough, and underline in plain-text 
> using the tag characters method described earlier in this thread.  This 
> enhancement is described in the release notes linked on this download page:
> 
> http://www.babelstone.co.uk/Software/index.html
>

I didn't say anything at the time this idea first came up, because I 
hoped people would understand that it was a bad idea.

Here's a little dirty secret about these tag characters: They were 
placed in one of the astral planes explicitly to make sure they'd use 4 
bytes per tag character, and thus quite a few bytes for any actual 
complete tags. See https://tools.ietf.org/html/rfc2482 for details. Note 
that RFC 2482 has been obsoleted by https://tools.ietf.org/html/rfc6082, 
in parallel with a similar motion on the Unicode side.

These tag characters were born only to shoot down an even worse 
proposal, https://tools.ietf.org/html/draft-ietf-acap-mlsf-01. For some 
additional background, please see 
https://tools.ietf.org/html/draft-ietf-acap-langtag-00.

The overall tag proposal had the desired effect: The original proposal 
to hijack some unused bytes in UTF-8 was defeated, and the tags itself 
were not actually used and therefore could be depreciated.

Bad ideas turn up once every 10 or 20 years. It usually takes some time 
for some of the people to realize that they are bad ideas. But that 
doesn't make them any better when they turn up again.

Regards,   Martin.



Re: Encoding italic

2019-01-29 Thread Martin J . Dürst via Unicode
On 2019/01/24 23:49, Andrew West via Unicode wrote:
> On Thu, 24 Jan 2019 at 13:59, James Kass via Unicode
>  wrote:

> We were told time and time again when emoji were first proposed that
> they were required for encoding for interoperability with Japanese
> telecoms whose usage had spilled over to the internet. At that time
> there was no suggestion that encoding emoji was anything other than a
> one-off solution to a specific problem with PUA usage by different
> vendors, and I at least had no idea that emoji encoding would become a
> constant stream with an annual quota of 60+ fast-tracked
> user-suggested novelties. Maybe that was the hidden agenda, and I was
> just naïve.

I don't think this was a hidden agenda. Nobody in the US or Europe 
thought that emoji would catch on like they did, with ordinary people 
and the press. Of course they had been popular in Japan, that's why the 
got into Unicode.


> The ESC and UTC do an appallingly bad job at regulating emoji, and I
> would like to see the Emoji Subcommittee disbanded, and decisions on
> new emoji taken away from the UTC, and handed over to a consortium or
> committee of vendors who would be given a dedicated vendor-use emoji
> plane to play with (kinda like a PUA plane with pre-assigned
> characters with algorithmic names [VENDOR-ASSIGNED EMOJI X] which
> the vendors can then associate with glyphs as they see fit; and as
> emoji seem to evolve over time they would be free to modify and
> reassign glyphs as they like because the Unicode Standard would not
> define the meaning or glyph for any characters in this plane).

To a small extent, that already happens. The example I'm thinking about 
is the transition from a (potentially bullet-carrying) pistol to a 
waterpistol. The Unicode consortium doesn't define the meaning of any of 
it's characters, and doesn't define stardard glyphs for characters, just 
example glyphs. Another example is a presenter at a conference who was 
using lots of emoji saying that he will need to redo his presentation 
because the vendor of his notebook's OS was in the process of changing 
their emoji designs.

Regards,Martin.



Re: Encoding italic

2019-01-17 Thread Martin J . Dürst via Unicode
On 2019/01/17 17:51, James Kass via Unicode wrote:
> 
> On 2019-01-17 6:27 AM, Martin J. Dürst replied:

>  > ...
>  > Based by these data points, and knowing many of the people involved, my
>  > description would be that decisions about what to encode as characters
>  > (plain text) and what to deal with on a higher layer (rich text) were
>  > taken with a wide and deep background, in a gradually forming industry
>  > consensus.
> 
> (IMO) All of which had to deal with the existing font size limitations 
> of 256 characters and the need to reserve many of those for other 
> textual symbols as well as box drawing characters.  Cause and effect. 
> The computer fonts weren't designed that way *because* there was a 
> technical notion to create "layers".  It's the other way around.  (If 
> I'm not mistaken.)

Most probably not. I think Asmus has already alluded to it, but in good 
typography, roman and italic fonts are considered separate. They are 
often used in sets, but it's not impossible e.g. to cut a new italic to 
an existing roman or the other way round. This predates any 8-bit/256 
characters limitations. Also, Unicode from the start knew that it had to 
deal with more than 256 characters, not only for East Asia, and so I 
don't think such size limits were a major issue when designing Unicode.

On the other hand, the idea that all Unicode characters (or a 
significant and as yet undetermined subset of them) would need 
italic,... variants definitely will have let do shooting down such 
ideas, in particular because Unicode started as strictly 16-bit.

Regards,   Martin.



Re: Encoding italic

2019-01-16 Thread Martin J . Dürst via Unicode



On 2019/01/17 12:38, James Kass via Unicode wrote:

> ( http://www.unicode.org/versions/Unicode11.0.0/ch02.pdf )
> 
> "Plain text must contain enough information to permit the text to be 
> rendered legibly, and nothing more."
> 
> The argument is that italic information can be stripped yet still be 
> read.  A persuasive argument towards encoding would need to negate that; 
> it would have to be shown that removing italic information results in a 
> loss of meaning.

Well, yes. But please be aware of the fact that characters and text are 
human inventions grown and developed in many cultures over many 
centuries. It's not something where a single sentence will make all the 
subsequent decisions easy.

So even if you can find examples where the presence or absence of 
styling clearly makes a semantic difference, this may or will not be 
enough. It's only when it's often or overwhelmingly (as opposed to 
occasionally) the case that a styling difference makes a semantic 
difference that this would start to become a real argument for plain 
text encoding of italics (or other styling information).

To give a similar example, books about typography may discuss the 
different shapes of 'a' and 'g' in various fonts (often, the roman 
variant uses one shape (e.g. the 'g' with two circles), and the italic 
uses the other (e.g. the 'g' with a hook towards the bottom right)). But 
just because in this context, these shapes are semantically different, 
doesn't mean that they need to be distinguished at the plain text level.
(There are variants for IPA that are restricted to specific shapes, 
namely 'ɑ' and 'ɡ', but that's a separate issue.)


> The decision makers at Unicode are familiar with italic use conventions 
> such as those shown in "The Chicago Manual of Style" (first published in 
> 1906).  The question of plain-text italics has arisen before on this 
> list and has been quickly dismissed.
> 
> Unicode began with the idea of standardizing existing code pages for the 
> exchange of computer text using a unique double-byte encoding rather 
> than relying on code page switching.  Latin was "grandfathered" into the 
> standard.  Nobody ever submitted a formal proposal for Basic Latin. 
> There was no outreach to establish contact with the user community -- 
> the actual people who used the script as opposed to the "computer nerds" 
> who grew up with ANSI limitations and subsequent ISO code pages. Because 
> that's how Unicode rolled back then.  Unicode did what it was supposed 
> to do WRT Basic Latin.

I think most Unicode specialists have chosen to ignore this thread by 
this point. In their defense, I would like to point out that among the 
people who started Unicode, there were definitely many people who were 
very familiar with styling needs. As a simple example, Apple was 
interested in styled text from the very early beginning. Others were 
very familiar with electronic publishing systems. There were also 
members from the library community, who had their own requirements and 
character encoding standards. And many must have known TeX and other 
kinds of typesetting and publishing software. GML and then SGML were 
developed by IBM.

Based by these data points, and knowing many of the people involved, my 
description would be that decisions about what to encode as characters 
(plain text) and what to deal with on a higher layer (rich text) were 
taken with a wide and deep background, in a gradually forming industry 
consensus.

That doesn't mean that for all these decisions, explicit proposals were 
made. But it means that even where these decisions were made implicitly 
(at least on the level of the Consortium and the ISO/IEC and national 
standards body committees), they were made with a full and rich 
understanding of user needs and technology choices.

This lead to the layering we have now: Case distinctions at the 
character level, but style distinctions at the rich text level. Any good 
technology has layers, and it makes a lot of sense to keep established 
layers unless some serious problem is discovered. The fact that Twitter 
(currently) doesn't allow styled text and that there is a small number 
of people who (mis)use Math alphabets for writing italics,... on Twitter 
doesn't look like a serious problem to me.


> When someone points out that italics are used for disambiguation as well 
> as stress, the replies are consistent.
> 
> "That's not what plain-text is for."  "That's not how plain-text 
> works."  "That's just styling and so should be done in rich-text." 
> "Since we do that in rich-text already, there's no reason to provide for 
> it in plain-text."  "You can already hack it in plain-text by enclosing 
> the string with slashes."  And so it goes.

As such, these answers might indeed not look very convincing. But they 
are given in the overall framework of text representation in today's 
technology (see above). And please note that the end user doesn't ask 
for "italics in plain 

Re: A last missing link for interoperable representation

2019-01-14 Thread Martin J . Dürst via Unicode
On 2019/01/15 07:58, David Starner via Unicode wrote:
> On Mon, Jan 14, 2019 at 2:09 AM Tex via Unicode  wrote:

>> ·Plain text still has tremendous utility and rich text is not always 
>> an option.
> 
> Where? Twitter has the option of doing rich text, as does any closed
> system. In fact, Twitter is rich text, in that it hyperlinks web
> addresses. That Twitter has chosen not to support italics is a choice.
> If users don't like this, they could go another system, or use
> third-party tools to transmit rich text over Twitter. The use of
> underscores or   markings for italics would be mostly
> compatible with human twitterers using the normal interface.

Yes indeed. Some similar services allow styling. One example is Slack, 
see e.g. 
https://get.slack.help/hc/en-us/articles/202288908-Format-your-messages.

Markdown has been mentioned as an example of how some basic styling 
options (bold, italic,...) can be implemented. Another choice is using 
an user interface component (menu,...). The user then doesn't have to 
care about any 'weird' conventions, even the simplest ones, nor about 
what happens in the background (most probably HTML), and already is 
familiar with it from other applications.

As for implementation complexity, it's not trivial, but there are quite 
a lot of components available, in particular for Web technology. It's 
not rocket science.

Actually, in some cases, it is even difficult to get rid of styling on 
the Web. I recently wanted to print out a map of how to get to a 
restaurant for a party. The restaurant's Web site was all black 
background. I copied the address to Google Maps and then tried to print 
it. Google Maps insists that the first page is just information about 
the location, so I copied the name of the restaurant from the Web page. 
What happened was that it still had the black background. So copy-paste 
on your average Web browser these days doesn't lose styles, even in 
cases where that would be desirable (because more legible).

So rich text technology is already way ahead when it comes to styled 
text. Do we want to encode background-color variant selectors in 
Unicode? If yes, how many?

[Hint: The last two questions are rhetorical.]

Regards,   Martin.



Re: A last missing link for interoperable representation

2019-01-14 Thread Martin J . Dürst via Unicode



On 2019/01/15 10:48, Mark E. Shoulson via Unicode wrote:
> On 1/14/19 4:21 PM, Asmus Freytag via Unicode wrote:

>> Short of that, I'm extremely leery of "leading" standardization; that 
>> is, encoding things that "might" be used.
>>
> It is certainly true that Unicode should not be (and wasn't, before 
> emoji)

Just to be precise, as already has been mentioned in this thread, the 
first batch of 'emoji' was in Unicode from the start (e.g. U+2603 
SNOWMAN, there since Unicode 1.1), I think from Zapf Dingbats. The 
second batch came from Japanese phones. So for the first two batches of 
emoji, Unicode did not do any "leading" standardization. It was only 
after that, for later batches, where that happened.

> in the business of encoding things that "could be used", but 
> rather, was for encoding things that *were* used.  This, naturally, 
> poses a chicken-and-egg problem which has been complained about by 
> several people in the past (including me).  Still, there are ways to 
> show that things that haven't been encoded are still being "used", as 
> people make shift to do what they can to use the script/notation, like 
> using PUA or characters that aren't QUITE right, but close...  And in 
> fairness, I'd have to say that the use of mathematical italics would 
> count in that regard.  It's hard to dispute that there is a demand for 
> it, just by looking at how people have been trying to do it!

"a demand" doesn't quantify the demand at all. My guess is that given 
the overall volume of Twitter or Facebook communication, the percentage 
of Math italics (ab)use is really, really low. It's impossible to say 
that there's no demand, but use cases like "look, I found these 
characters, aren't they cute" in some corners of some social services is 
not the same as "we urgently need this, otherwise we can't communicate 
in our language".

Regards,Martin.



Re: A last missing link for interoperable representation

2019-01-14 Thread Martin J . Dürst via Unicode
Hello James, others,

On 2019/01/14 15:24, James Kass via Unicode wrote:
> 
> Martin J. Dürst wrote,
> 
>  > I'd say it should be conservative. As the meaning of that word
>  > (similar to others such as progressive and regressive) may be
>  > interpreted in various way, here's what I mean by that.
>  >
>  > It should not take up and extend every little fad at the blink of an
>  > eye. It should wait to see what the real needs are, and what may be
>  > just a temporary fad. As the Mathematical style variants show, once
>  > characters are encoded, it's difficult to get people off using them,
>  > even in ways not intended.
> 
> A conservative approach to progress is a sensible position for computer 
> character encoders.  Taking a conservative approach doesn't necessarily 
> mean being anti-progress.
> 
> Trying to "get people off" using already encoded characters, whether or 
> not the encoded characters are used as intended, might give an 
> impression of being anti-progress.

Using the expression "get people off" was indeed somewhat ambiguous. Of 
course we cannot forbid people to use Mathematical alphanumerics. 
There's no standards police, neither for Unicode nor most other standards.


> Unicode doesn't enforce any spelling or punctuation rules.  Unicode 
> doesn't tell human beings how to pronounce strings of text or how to 
> interpret them.  Unicode doesn't push any rules about splitting 
> infinitives or conjugating verbs.
> 
> Unicode should not tell people how any written symbol must be 
> interpreted.  Unicode should not tell people how or where to deploy 
> their own written symbols.

Yes. But Unicode can very well say: These characters are for Math, and 
if you use them for anything else, that's your problem, and because they 
are used for Math, they support what's used in Math, and we won't add 
copies of accented characters or variant characters for style or [your 
proposal goes here] because that's not what Unicode is about. If you 
want real styling, then use applications that can do that, or try to 
convince your application provider to provide that.

(Well, Unicode is more or less saying just exactly that currently.)

And that's what I meant with "getting people off". If that then leads to 
less people (mis)using these characters, all the better.


> Perhaps fraktur is frivolous in English text.  Perhaps its use would 
> result in a new convention for written English which would enhance the 
> literary experience.  Italics conventions which have only been around a 
> hundred years or so may well turn out to be just a passing fad, so we 
> should probably give it a bit more time.

There's no need to give italic conventions more time. Of course they may 
die out, but they are very active now. And they are very actively 
supported in rich text, where they belong.


> Telling people they mustn't use Latin italics letter forms in computer 
> text while we wait to see if the practice catches on seems flawed in 
> concept.

The practice is already there. Lots of people use italics in rich text. 
That's just fine because that's the right thing to do. We don't need to 
muddy the waters.

Regards,   Martin.



Re: A last missing link for interoperable representation

2019-01-14 Thread Martin J . Dürst via Unicode
Hello James, others,

 From the examples below, it looks like a feature request for Twitter 
(and/or Facebook). Blaming the problem on Unicode doesn't seem to be 
appropriate.

Regards,   Martin.

On 2019/01/14 18:06, James Kass via Unicode wrote:
> 
> Not a twitter user, don't know how popular the practice is, but here's a 
> couple of links concerned with how to use bold or italics in Twitter 
> plain text messages.
> 
> https://www.simplehelp.net/2018/03/13/how-to-use-bold-and-italicized-text-on-twitter/
>  
> 
> https://mothereff.in/twitalics
> 
> Both pages include a form of caveat.  But the caveat isn't about the 
> intended use of the math alphanumerics.
> 
> The first page includes the following text as part of a "tweet":
> Just because you 헰헮헻 doesn’t mean you 혴혩혰혶혭혥 :)
> 
> And, as before, I have no idea how /popular/ the practice is.  But 
> here's some more links:
> 
> (web page from 2013)
> How To Write In Italics, Tweet Backwards And Use Lots Of Different ...
> https://www.adweek.com/digital/twitter-font-italics-backwards/
> 
> (This is copy/pasted *as-is* from the web page to plain-text)
> Bold and Italic Unicode Text Tool - 퐁퐨퐥퐝 풂풏풅 푖푡푎푙푖푐푠 - 
> YayText
> https://yaytext.com/bold-italic/
> Super cool unicode text magic. Write 퐛퐨퐥퐝 and/or 푖푡푎푙푖푐 
> updates on Facebook, Twitter, and elsewhere. Bold (serif) preview copy 
> tweet.
> 
> Michael Maurino [emoji redacted-JK] on Twitter: "Can I make italics on 
> twitter? 'cause ...
> https://twitter.com/iron_stylus/status/281991180064022528?lang=en
> 
> Charlie Brooker on Twitter: "How do you do italics on this thing again?"
> https://twitter.com/charltonbrooker/status/484623185862983680?lang=en
> 
> How to make your Facebook and Twitter text bold or italic, and other ...
> https://boingboing.net/2016/04/10/yaytext-unicode-text-styling.html
> Apr 10, 2016 - For years I've been using the Panix Unicode Text 
> Converter to create ironic, weird or simply annoying text effects for 
> use on Twitter, Facebook ...
> 
> How to change your Twitter font | Digital Trends
> https://www.digitaltrends.com/.../now-you-can-use-bold-italics-and-other-fancy-fonts-...
>  
> 
> Aug 14, 2013 - now you can use bold italics and other fancy fonts on 
> twitter isaac ... or phrase into your Twitter text box, and there you 
> have it: fancy tweets.
> 
> Twitter Fonts Generator (퓬퓸퓹픂 퓪퓷퓭 퓹퓪퓼퓽퓮) ― LingoJam
> https://lingojam.com/TwitterFonts
> You might have noticed that some users on Twitter are able to change the 
> font ... them to seemingly make their tweet font bold, italic, or just 
> completely different.
> 




Re: A last missing link for interoperable representation

2019-01-13 Thread Martin J . Dürst via Unicode
On 2019/01/14 01:46, Julian Bradfield via Unicode wrote:
> On 2019-01-12, Richard Wordingham via Unicode  wrote:
>> On Sat, 12 Jan 2019 10:57:26 + (GMT)

>> And what happens when you capitalise a word for emphasis or to begin a
>> sentence?  Is it no longer the same word?
> 
> Indeed. As has been observed up-thread, the casing idea is a dumb one!
> We are, however, stuck with it because of legacy encoding transported
> into Unicode. We aren't stuck with encoding fonts into Unicode.

No, the casing idea isn't actually a dumb one. As Asmus has shown, one 
of the best ways to understand what Unicode does with respect to text 
variants is that style works on spans of characters (words,...), and is 
rich text, but thinks that work on single characters are handled in 
plain text. Upper-case is definitely for most part a single-character 
phenomenon (the recent Georgian MTAVRULI additions being the exception).

UPPER CASE can be used on whole spans of text, but that's not the main 
use case. And if UPPER CASE is used for emphasis, one way to do it (and 
the best way if this is actually a styling issue) is to use rich text 
and mark it up according to semantics, and then use some styling 
directive (e.g. CSS text-transform: uppercase) to get the desired look.


Another criterion is orthography. Schoolchildren learn when to 
capitalize a word and when not. Teachers check and correct it all the 
time. Grammar books and books for second language learners discuss 
capitalization, because it's part of orthography, the rules differ by 
language, and not getting it right will make the writer look bad.

But even most adults won't know the rules for what to italicize that 
have been brought up in this thread. Even if they have read books that 
use italic and bold in ways that have been brought up in this thread, 
most readers won't be able to tell you what the rules are. That's left 
to copy editors and similar specialist jobs.

There was a time when computers (and printers in particular) were 
single-case. There was some discussion about having to abolish case 
distinctions to adapt to computers, but fortunately, that wasn't necessary.

Regards,   Martin.



Re: A last missing link for interoperable representation

2019-01-13 Thread Martin J . Dürst via Unicode
On 2019/01/13 13:24, James Kass via Unicode wrote:
> 
> Mark E. Shoulson wrote,
> 
>  > This discussion has been very interesting, really.  I've heard what I
>  > thought were very good points and relevant arguments from both/all
>  > sides, and I confess to not being sure which I actually prefer.
> 
> It's subjective, really.  It depends on how one views plain-text and 
> one's expectations for its future.  Should plain-text be progressive, 
> regressive, or stagnant?  Because those are really the only choices. And 
> opinions differ.

I'd say it should be conservative. As the meaning of that word (similar 
to others such as progressive and regressive) may be interpreted in 
various way, here's what I mean by that.

It should not take up and extend every little fad at the blink of an 
eye. It should wait to see what the real needs are, and what may be just 
a temporary fad. As the Mathematical style variants show, once 
characters are encoded, it's difficult to get people off using them, 
even in ways not intended.

Emoji have often been often cited in this thread. But there are some 
important observations:

1) Emoji were added to Unicode only after it turned out that they were
widely used in Japanese character encodings, and dripping into
Unicode-based systems in large numbers but without any clearly
assigned code points. The Unicode Consortium didn't start encoding
them because they thought emoji were cute or progressive or anything
like that.

2) The Unicode Consortium is continuing to hold down the number of newly
encoded emoji by using an approximate limit for each year and a
strict process.

3) The Unicode Consortium is somewhat motivated to encode new emoji
because of the publicity surrounding them. That publicity might
subside sooner or later. It's difficult to imagine the same kind
of publicity for italics and friends.

> Most of us involved with Unicode probably expect plain-text to be around 
> for quite a while.  The figure bandied about in the past on this list is 
> "a thousand years".  Only a society of mindless drones would cling to 
> the past for a millennium.  So, many of us probably figure that 
> strictures laid down now will be overridden as a matter of course, over 
> time.
> 
> Unicode will probably be around for awhile, but the barrier between 
> plain- and rich-text has already morphed significantly in the relatively 
> short period of time it's been around.

Because whatever is encoded can't be "unencoded", it's clear that we can 
only move in one direction, and not back. But because we want Unicode to 
work for a long, long time, it's very important to be conservative.

> I became attracted to Unicode about twenty years ago.  Because Unicode 
> opened up entire /realms/ of new vistas relating to what could be done 
> with computer plain text.  I hope this trend continues.

I hope this trend only continues very slowly, if at all.

Regards,Martin.



Re: A last missing link for interoperable representation

2019-01-11 Thread Martin J . Dürst via Unicode
On 2019/01/11 16:13, James Kass via Unicode wrote:

> Styled Latin text is being simulated with math alphanumerics now, which 
> means that data is being interchanged and archived.  That's the user 
> demand illustrated.

Almost by definition, styled text isn't plain text, even if it's 
simulated by something else. And the simulation is highly limited, as 
the voicing examples and the fact that the math alphanumerics only cover 
basic Latin have shown.

Regards,   Martin.



Re: A last missing link for interoperable representation

2019-01-10 Thread Martin J . Dürst via Unicode
On 2019/01/11 10:48, James Kass via Unicode wrote:

> Is it true that many of the CJK variants now covered were previously 
> considered by the Consortium to be merely stylistic variants?

What is a stylistic variant or not is quite a bit more complicated for 
CJK than for scripts such as Latin. In some contexts, something may be 
just a stylistic variant, whereas in other contexts (e.g. person 
registries,...), it may be more than a stylistic distinction.

Also, in contrast to the issue discussed in the current thread, there's 
no consistent or widely deployed solution for such CJK variants in rich 
text scenarios such as HTML.

Regards,Martin.



Re: A sign/abbreviation for "magister"

2018-10-31 Thread Martin J . Dürst via Unicode
On 2018/11/01 03:10, Marcel Schneider via Unicode wrote:
> On 31/10/2018 at 17:27, Julian Bradfield via Unicode wrote:

>> When one does question the Académie about the fact, this is their
>> reply:
>>
>> Le fait de placer en exposant ces mentions est de convention
>> typographique ; il convient donc de le faire. Les seules exceptions
>> sont pour Mme et Mlle.
> Translation:
> “Superscripting these mentions is typographical convention;
> consequently it is convenient to do so. The only exceptions are
> for "Mme" [short for "Madame", Mrs] and "Mlle" [short for "Mademoiselle", 
> Ms].”
>>
>> which, if my understanding of "convient" is correct, carefully does
>> quite say that it is *wrong* not to superscript, but that one should
>> superscript when one can because that is the convention in typography.

As for translation of "il convient", I think Julian is closer to the 
intended meaning. The verb "convenir" has several meanings (see e.g. 
https://www.collinsdictionary.com/dictionary/french-english/convenir), 
but especially in this impersonal usage, the meaning "it is advisable, 
it is right to, it is proper to" seems to be most appropriate in this 
context.

It may not at all be convenient (=practical) to use the superscripts, 
e.g. if they are not easily available on a keyboard.

Regards,   Martin.

(French isn't my native language, and nor is English)



Re: A sign/abbreviation for "magister"

2018-10-31 Thread Martin J . Dürst via Unicode
On 2018/10/31 03:51, Marcel Schneider via Unicode wrote:
> On 30/10/2018 at 18:59, Doug Ewell via Unicode wrote:
>>
>> Marcel Schneider wrote:
>>
>>> This use case is different from the use case that led to submit
>>> the L2/18-206 proposal, cited by Dr Ewell on 29/10/2018 at 20:29:
>>
>> I guess this is intended as a compliment.
> 
> Right.
> 
>> While many of the people you
>> quoted do have doctoral degrees, many others of us do not.

And even those who have such degrees don't expect them to be used on a 
mailing list.

> Making a safe distinction is beyond my knowledge, safest is not to 
> discriminate.

Yes. The easiest way to not discriminate is to not use titles in mailing 
list discussions. That's what everybody else does, and what I highly 
recommend.

Regards,Martin.



Re: A sign/abbreviation for "magister"

2018-10-29 Thread Martin J . Dürst via Unicode
On 2018/10/29 05:42, Michael Everson via Unicode wrote:
> This is no different the Irish name McCoy which can be written MᶜCoy where 
> the raising of the c is actually just decorative, though perhaps it was once 
> an abbreviation for Mac. In some styles you can see a line or a dot under the 
> raised c. This is purely decorative.
> 
> I would encode this as Mʳ if you wanted to make sure your data contained the 
> abbreviation mark. It would not make sense to encode it as M=ͬ or anything 
> else like that, because the “r” is not modifying a dot or a squiggle or an 
> equals sign. The dot or squiggle or equals sign has no meaning at all. And I 
> would not encode it as Mr͇, firstly because it would never render properly 
> and you might as well encode it as Mr. or M:r, and second because in the IPA 
> at least that character indicates an alveolar realization in disordered 
> speech. (Of course it could be used for anything.)

I think this may depend on actual writing practice. In German at least, 
it is customary to have dots (periods) at the end of abbreviations, and 
using any other symbol, or not using the dot, would be considered an error.

The question of how to encode that dot is fortunately an easy one, but 
even if it were not, German-writing people would find a sentence such as 
"The dot or ... has no meaning at all." extremely weird. The dot is 
there (and in German, has to be there) because it's an abbreviation.

Regards,   Martin.



Re: Fallback for Sinhala Consonant Clusters

2018-10-14 Thread Martin J. Dürst via Unicode

Hello Richard,

On 2018/10/14 09:02, Richard Wordingham via Unicode wrote:

Are there fallback rules for Sinhala consonant clusters?  There are
fallback rules for Devanagari, but I'm not sure if they read across.

The problem I am seeing is that the Pali syllable 'ndhe' න්‍ධෙ 


Let's label this as (1)


is being rendered identically to a hypothetical Sinhalese
'nēdha' නේධ ,


It (2) doesn't look identically to (1) here (Thunderbird on Win 8.1).

Your mail is written as if you are speaking about a general phenomenon, 
but I guess there are differences depending on the font and rendering stack.



which in NFD is
, when I use a font that lacks the
conjunct.  (Most fonts lack the conjunct.)  The Devanagari rules and my
preference would lead to a fallback rendering as න්ධෙ  (Sinhalese
'ndhe'),


Here, this (3) looks like it has the same three components as (2), but 
the first two are exchanged, so that the piece that looks like @ is now 
in the middle (it was at the left in (1) and (2)).


Hope this helps.  Regards,Martin.


which is encoded as .  Is the rendering I am getting
technically wrong, or is it merely undesirable?

The ambiguity arises in part because, like the Brahmi script, the
Sinhala script uses its virama character as a vowel length indicator.

Missing touching consonants are being rendered almost as though there
were no ZWJ, but the combination of consonant and al-lakuna is being
rendered badly.

Richard.

.



--
Prof. Dr.sc. Martin J. Dürst
Department of Intelligent Information Technology
College of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan


Re: Dealing with Georgian capitalization in programming languages

2018-10-09 Thread Martin J. Dürst via Unicode

Hello Ken, others,

On 2018/10/03 06:43, Ken Whistler wrote:

But it seems to me that the problem you are citing can be avoided if you 
simply rethink what your "capitalize" means. It really should be 
conceived of as first lowercasing the *entire* string, and then 
titlecasing the *eligible* letters -- i.e., usually the first letter. 
(Note that this allows for the concept that titlecasing might then be 
localized on a per-writing-system basis -- the issue would devolve to 
determining what the rules are for "eligible" letters.) But the simple 
default would just be to titlecase the initial letter of each "word" 
segment of a string.


Note that conceived this way, for the Georgian mappings, where the 
titlecase mapping for Mkhedruli is simply the letter itself, this 
approach ends up with:


capitalize(mkhedrulistring) --> mkhedrulistring

capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> 
mkhedrulistring


Thus avoiding any mixed case.


I have been thinking through this. It seems quite appealing.

But I'm concerned there may be some edge cases. I have been able to come 
up with two so far:


- Applying this to a string starting with upper-case SZ (U+1E9E).
  This may change SZ → ß → Ss.
- Using the 'capitalize' method to (try to) get the titlecase
  property of a MTAVRULI character. (There's no other way
  currently in Ruby to get the titlecase property.)

There may be others. If you have some ideas, I'd appreciate to know 
about them.


This lets me wonder why the UTC didn't simply declare the titlecase 
property of MTAVRULI to be mkhedruli. Was this considered or not? The 
way things are currently set up, there seems to be no benefit of 
MTAVRULI being its own titlecase, because in actual use, that requires 
additional processing.


Regards,   Martin.


Re: Dealing with Georgian capitalization in programming languages

2018-10-04 Thread Martin J. Dürst via Unicode

Ken, Markus,

Many thanks for your ideas, which I noted at
https://bugs.ruby-lang.org/issues/14839.

Regards,   Martin.

On 2018/10/03 06:43, Ken Whistler wrote:


On 10/2/2018 12:45 AM, Martin J. Dürst via Unicode wrote:



My questions here are:
- Has this been considered when Georgian Mtavruli was discussed in the
  UTC?

Not explicitly, that I recall. The whole issue of titlecasing came up 
very late in the preparation of case mapping tables for Mtavruli and 
Mkhedruli for 11.0.


But it seems to me that the problem you are citing can be avoided if you 
simply rethink what your "capitalize" means. It really should be 
conceived of as first lowercasing the *entire* string, and then 
titlecasing the *eligible* letters -- i.e., usually the first letter. 
(Note that this allows for the concept that titlecasing might then be 
localized on a per-writing-system basis -- the issue would devolve to 
determining what the rules are for "eligible" letters.) But the simple 
default would just be to titlecase the initial letter of each "word" 
segment of a string.


Note that conceived this way, for the Georgian mappings, where the 
titlecase mapping for Mkhedruli is simply the letter itself, this 
approach ends up with:


capitalize(mkhedrulistring) --> mkhedrulistring

capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> 
mkhedrulistring


Thus avoiding any mixed case.




Dealing with Georgian capitalization in programming languages

2018-10-02 Thread Martin J. Dürst via Unicode
Since the last discussion on Georgian (Mtavruli) on this mailing list, I 
have been looking into how to implement it in the Programming language Ruby.


Ruby has four case-conversion operations for its class String:

upcase:   convert all characters to upper case
downcase: convert all characters to lower case
swapcase: switch upper to lower and lower to upper case
capitalize:  uppercase (or title-case) the first character of the 
string, lowercase the rest


'upcase' and 'downcase' don't pose problems. 'swapcase' doesn't cause 
problems assuming the input doesn't have any problems. The only 
operation that can cause problems is 'capitalize'.


When I say "cause problems", I mean producing mixed-case output. I 
originally thought that 'capitalize' would be fine. It is fine for 
lowercase input: I stays lowercase because Unicode Data indicates that 
titlecase for lowercase Georgian letters is the letter itself. But it 
will produce the apparently undesirable Mixed Case for ALL UPPERCASE input.


My questions here are:
- Has this been considered when Georgian Mtavruli was discussed in the
  UTC?
- How have any other implementers (ICU,...) addressed this, in
  particular the operation that's called 'capitalize' in Ruby?

Many thanks in advance for your input,

Regards,   Martin.


Re: Shortcuts question

2018-09-16 Thread Martin J. Dürst via Unicode

On 2018/09/16 21:08, Marcel Schneider via Unicode wrote:


An additional level of complexity is induced by ergonomics. so that most 
non-Latin layouts may wish to stick
with QWERTY, and even ergonomic layouts in the footprints of August Dvorak 
rather than Shai Coleman are
likely to offer variants with legacy Virtual Key mapping instead of staying in 
congruency with graphics optimized
for text input.


From my personal experience: A few years ago, installing a Dvorak 
keyboard (which is what I use every day for typing) didn't remap the 
control keys, so that Ctrl-C was still on the bottom row of the left 
hand, and so on. For me, it was really terrible.


It may not be the same for everybody, but my experience suggests that it 
may be similar for some others, and that therefore such a mapping should 
only be voluntary, not default.


Regards,   Martin.



Re: UCD in XML or in CSV? (is: UCD in YAML)

2018-09-07 Thread Martin J. Dürst via Unicode

On 2018/09/08 04:47, Rebecca Bettencourt via Unicode wrote:

On Fri, Sep 7, 2018 at 11:20 AM Philippe Verdy via Unicode <
unicode@unicode.org> wrote:


That version has been announced in the Windows 10 Hub several weeks ago.



And it only took them 33 years. :)


I used to joke that Notepad would add one single feature for each new 
version of Windows. I think that was when the Save-As feature was added.


For a long time, I have set up Notepad++ to come up when Notepad is invoked.

Regards,Martin.


Re: Diacritic marks in parentheses

2018-07-27 Thread Martin J. Dürst via Unicode

On 2018/07/27 01:27, Markus Scherer via Unicode wrote:

I would not expect for Ä+combining () above = Ä᪻ to look right except with
specialized fonts.
http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%84%5Cu1ABB==0

Even if it worked widely, I think it would be confusing.


Yes, for the moment. We don't know how this will develop.

(Famous German (grammatically incorrect) saying:
Man gewöhnt sich an allem, auch am Dativ.)


I think you are best off writing Arzt/Ärztin.


Regards,   Martin.


Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-05 Thread Martin J. Dürst via Unicode

Hello Rebecca,

On 2018/06/05 12:43, Rebecca T via Unicode wrote:


Something I’d love to see is translated keywords; shouldn’t be hard with a
line in the cargo.toml for a ruidmentary lookup. Again, I’m of the opinion
that an imperfect implementation is better than no attempt. I remember
reading an article about a professor who translated the keywords in...
maybe it was Python? And found their students were much more engaged with
the material. Anecdotal, of course, but it’s stuck with me.


It would be good to have a reference for this. I can certainly see the 
point. But on the other hand, I have also heard that using keywords in a 
foreign language makes it clear that there may be a difference between 
the everyday use of the word and the specific formal meaning in the 
programming language. Then, there's also the problem that just 
translating keywords may work for languages with the same sentence 
structure, but not for languages with a completely different sentence 
structure. On top of that, keywords are just a start; 
class/function/method names in libraries would have to be translated, 
too, which would be much more work (especially if one wants to do a good 
job).


Regards,Martin.


Re: Hyphenation Markup

2018-06-02 Thread Martin J. Dürst via Unicode

Hello Richard,

On 2018/06/02 20:37, Richard Wordingham via Unicode wrote:


Am 2018-06-02 um 06:44 schrieb Richard Wordingham via Unicode:

In Latin text, one can indicate permissible line break opportunities
between grapheme clusters by inserting U+00AD SOFT HYPHEN.  What
low-end schemes, if any, exist for such mark-up within grapheme
clusters?



1) In the sequence



realisation of the break should definitely result in  on one line and in  on the next
line, whereas in visual order, character-2 should precede character-1.


My question goes a bit further than to Doug's: Why would you want to do 
such a thing? Are there actual scripts/languages where line breaks 
within grapheme clusters occur? If yes, what are there? Can you show 
actual examples, e.g. scans of documents,...?


In writing systems, there are almost always exceptions to simple rules, 
but in general, breaking a line *within* a grapheme cluster seems to be 
a bad idea.


Regards,   Martin.


Re: Uppercase ß

2018-05-29 Thread Martin J. Dürst via Unicode

On 2018/05/29 17:15, Hans Åberg via Unicode wrote:



On 29 May 2018, at 07:30, Asmus Freytag via Unicode  wrote:



An uppercase exists and it has formally been ruled as acceptable way to write 
this letter (mostly an issue for ALL CAPS as ß does not occur in word-initial 
position).
A./


Duden used one in 1957, but stated in 1984 that there is no uppercase version 
[1]. So it would be interesting with a reference to an official version.

1. https://en.wikipedia.org/wiki/ß


The English wikipedia may not be fully up to date.
See https://de.wikipedia.org/wiki/Großes_ß (second paragraph):

"Seit dem 29. Juni 2017 ist das ẞ Bestandteil der amtlichen deutschen 
Rechtschreibung.[2][3]"


Translated to English: "Since June 29, 2017, the ẞ is part of the 
official German orthography."


(As far as I remember the discussion (on this list?) last year, the ẞ 
(uppercase ß) is allowed, but not required.)


Regards,   Martin.



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Martin J. Dürst via Unicode

Hello Sundar,

On 2018/05/28 04:27, SundaraRaman R via Unicode wrote:

Hi,

In languages like Ruby or Java
(https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)),
functions to check if a character is alphabetic do that by looking for
the 'Alphabetic'  property (defined true if it's in one of the L
categories, or Nl, or has 'Other_Alphabetic' property). When parsing
Tamil text, this works out well for independent vowels and consonants
(which are in Lo), and for most dependent signs (which are in Mc or Mn
but have the 'Other_Alphabetic' property), but the very common pulli (VIRAMA)
is neither in Lo nor has 'Other_Alphabetic', and so leads to
concluding any string containing it to be non-alphabetic.

This doesn't make sense to me since the Virama  “◌்” as much of an
alphabetic character as any of the "Dependent Vowel" characters which
have been given the 'Other_Alphabetic' property. Is there a rationale
behind this difference, or is it an oversight to be corrected?


I suggest submitting an error report via 
https://www.unicode.org/reporting.html. I haven't studied the issue in 
detail (sorry, just no time this week), but it sounds reasonable to give 
the VIRAMA the 'Other_Alphabetic' property.


I'd recommend to mention examples other than Tamil in your report 
(assuming they exist).


BTW, what's the method you are using in Ruby? If there's a problem in 
Ruby (which I don't think; it's just using Unicode data), then please 
make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I 
should be able to follow up on that.


Regards,   Martin.


Re: Major vendors changing U+1F52B PISTOL  depiction from firearm to squirt gun

2018-05-23 Thread Martin J. Dürst via Unicode

On 2018/05/24 03:00, Michael Everson via Unicode wrote:

I consider it a significant semantic shift from the intended meaning of the 
character in the source Japanese character set.


Yes and no. I'd consider the semantic shift from a real pistol in a 
Japanese message to a real pistol in a message in the US quite significant.


The former, except for some extremely small and marginal segment of 
Japanese society, essentially has no "I might shoot you" implications at 
all. In the later case, that may be quite a bit different.


I'm not saying the (glyph or whatever you call it) change was okay. But 
when talking about semantics, it's important to not only consider 
surface semantics, but also the overall context.


Regards,   Martin.


Re: Is the Editor's Draft public?

2018-04-20 Thread Martin J. Dürst via Unicode

On 2018/04/20 18:12, Martin J. Dürst wrote:

There was an announcement for a public review period just recently. The 
review period is up to the 23rd of April. I'm not sure whether the 
announcement is up somewhere on the Web, but I'll forward it to you 
directly.


Sorry, found the Web address of the announcement at the very bottom of 
the mail: 
http://blog.unicode.org/2018/04/last-call-on-unicode-110-review.html


Regards,   Martin.


Re: Is the Editor's Draft public?

2018-04-20 Thread Martin J. Dürst via Unicode

Hello Henri,

On 2018/04/20 17:15, Henri Sivonen via Unicode wrote:

Is the Editor's Draft of the Unicode Standard visible publicly?

Use case: Checking if things that I might send feedback about have
already been addressed since the publication of Unicode 10.0.


There was an announcement for a public review period just recently. The 
review period is up to the 23rd of April. I'm not sure whether the 
announcement is up somewhere on the Web, but I'll forward it to you 
directly.


Regards,   Martin.




Re: Fwd: RFC 8369 on Internationalizing IPv6 Using 128-Bit Unicode

2018-04-02 Thread Martin J. Dürst via Unicode

On 2018/04/03 10:56, Mark E. Shoulson via Unicode wrote:
Whew!  Thanks for explaining the joke! Everyone here really thought they 
were serious.  Maybe you should write to the authors of the RFC and 
explain to them that their growth-function is incorrect.  I'm sure 
they'd be glad of the correction.


I'm sure they know they exaggerated quite a bit. I'm also sure they 
trust the Unicode Consortium to know when they would have to enlarge the 
code space, if every.


Regards,   Martin.


Fwd: RFC 8369 on Internationalizing IPv6 Using 128-Bit Unicode

2018-04-01 Thread Martin J. Dürst via Unicode
Please enjoy. Sorry for being late with forwarding, at least in some 
parts of the world.


Regards,   Martin.


 Forwarded Message 
Subject: RFC 8369 on Internationalizing IPv6 Using 128-Bit Unicode
Date: Sun,  1 Apr 2018 08:29:00 -0700 (PDT)
From: rfc-edi...@rfc-editor.org
Reply-To: i...@ietf.org
To: ietf-annou...@ietf.org, rfc-d...@rfc-editor.org
CC: drafts-update-...@iana.org, rfc-edi...@rfc-editor.org

A new Request for Comments is now available in online RFC libraries.

RFC 8369

Title:  Internationalizing IPv6 Using 128-Bit Unicode 
  Author: H. Kaplan

Status: Informational
Stream: Independent
Date:   1 April 2018 Mailbox: 
hadr...@128technology.com

Pages:  11
Characters: 24429
Updates/Obsoletes/SeeAlso:   None

I-D Tag:draft-kaplan-unicode-ipv6-00.txt

URL:https://www.rfc-editor.org/info/rfc8369

DOI:10.17487/RFC8369

It is clear that Unicode will eventually exhaust its supply of code
points, and more will be needed.  Assuming ISO and the Unicode
Consortium follow the practices of the IETF, the next Unicode code
point size will be 128 bits.  This document describes how this future
128-bit Unicode can be leveraged to improve IPv6 adoption and finally
bring internationalization support to IPv6.


INFORMATIONAL: This memo provides information for the Internet community.
It does not specify an Internet standard of any kind. Distribution of
this memo is unlimited.

This announcement is sent to the IETF-Announce and rfc-dist lists.
To subscribe or unsubscribe, see
  https://www.ietf.org/mailman/listinfo/ietf-announce
  https://mailman.rfc-editor.org/mailman/listinfo/rfc-dist

For searching the RFC series, see https://www.rfc-editor.org/search
For downloading RFCs, see https://www.rfc-editor.org/retrieve/bulk

Requests for special distribution should be addressed to either the
author of the RFC in question, or to rfc-edi...@rfc-editor.org.  Unless
specifically noted otherwise on the RFC itself, all RFCs are for
unlimited distribution.


The RFC Editor Team
Association Management Solutions, LLC


.



Re: A sketch with the best-known Swiss tongue twister

2018-03-13 Thread Martin J. Dürst via Unicode

On 2018/03/09 21:24, Mark Davis ☕️ wrote:

There are definitely many dialects across Switzerland. I think that for
*this* phrase it would be roughly the same for most of the population, with
minor differences (eg 'het' vs 'hät'). But a native speaker like Martin
would be able to say for sure.


Yes indeed. The differences would be in the vowels (not necessarily 
minor, but your mileage may vary), and the difficulty of this tongue 
twister is very much on the consonants.


Regards,   Martin.


Re: A sketch with the best-known Swiss tongue twister

2018-03-13 Thread Martin J. Dürst via Unicode

On 2018/03/10 20:26, philip chastney via Unicode wrote:


I would make the following observations on terminology in practice:



-- the newspapers in Zurich advertised courses in "Hoch Deutsch", for those who 
needed to deal with foreigners


This should probably be written 'the newspapers in Zurich advertised 
courses in "Hochdeutsch", for foreigners'. Hochdeutsch (Standard German) 
is the language used in school, and in writing, and while there may be 
some specialized courses for Swiss people who didn't do well throughout 
grade school and want to catch up, that's not what the advertisements 
are about.




-- in Luxemburg, the same language was referred to as Luxemburgish (or Letzeburgesch, 
which is Luxemburgish for "Luxemburgish ")
 (I forget what the Belgians called the language spoken in Ostbelgien)

-- I was assured by a Luxemburgish-speaking car mechanic, with a Swiss German 
speaking wife, that the two languages (dialects?) were practically identical, 
except for the names of some household items


I can't comment on this, because I don't remember to ever have listened 
to somebody speaking Letzeburgesch.



in short, there seems little point in making distinctions which cannot be 
precisely identified in practice

there appear to be significant differences between between High German and 
(what the natives call) Swiss German

there are far fewer significant differences between Swiss German and the other 
spoken Germanic languages found on the borders of Germany


In terms of linguistic analysis, that may be true. But virtually every 
native Swiss German speaker would draw a clear line between Swiss German 
(including the dialect(s) spoken in the upper Valais (Oberwallis), which 
are classified differently by linguists) and other varieties such as 
Swabian, Elsatian, Vorarlbergian, or even Letzeburgesch (which I have 
never seen classified as Allemannic)).


The reason for this is not so much basic linguistics, but much more a) 
vocabulary differences ranging from food to administrative terms, and b) 
the fact that people hear many different Swiss dialects on Swiss Radio 
and Television, while that's not the case for the dialects from outside 
the borders. So in practice, Swiss German can be delineated quite 
precisely, but more from a sociolinguistic and vocabulary perspective 
than from a purely evolutionary/historic linguistic perspective.


[Disclaimer: I'm not a linguist.]

Regards,   Martin.


Re: base1024 encoding using Unicode emojis

2018-03-12 Thread Martin J. Dürst via Unicode

On 2018/03/12 02:07, Keith Turner via Unicode wrote:


Yeah, it certainly results in larger utf8 strings.  For example a sha256
hash is 112 bytes when encoded as Ecoji utf8.  For base64, sha256 is 44
bytes.

Even though its more bytes, Ecoji has less visible characters than base64
for sha256.  Ecoji has 28 visible characters and base64 44.  So that makes
me wonder which one would be quicker for a human to verify on average?
Also, which one is more accurate for a human to verify? I have no idea. For
accuracy, it seems like a lot of thought was put into the visual uniqueness
of Unicode emojis.


Using emoji to help people verify security information is an interesting 
idea. What I'm afraid is that even if emoji are designed with 
distinctiveness in mind, some people may have difficulties distinguish 
all the various face variants. Also, while emoji get designed so that 
in-font distinguishability is high, the same may not apply across fonts 
(e.g. if one has to compare a printed version with a version on-screen).


Regards,   Martin.



2018-03-11 6:04 GMT+01:00 Keith Turner via Unicode :


I created a neat little project based on Unicode emojis.  I thought
some on this list may find it interesting.  It encodes arbitrary data
as 1024 emojis.  The project is called Ecoji and is hosted on github
at https://github.com/keith-turner/ecoji

Below are some examples of encoding and decoding.

$ echo 'Unicode emojis are awesome!!' | ecoji
卵駱

$ echo 卵駱   | ecoji -d
Unicode emojis are awesome!!

I would eventually like to create a base4096 version when there are more
emojis.




Re: Unicode Emoji 11.0 characters now ready for adoption!

2018-03-09 Thread Martin J. Dürst via Unicode

On 2018/03/09 10:22, Philippe Verdy via Unicode wrote:

As well how Chinese/Japanese post offices handle addresses written with
sinograms for personal names ? Is the expanded IDS form acceptable for
them, or do they require using Romanized addresses, or phonetic
approximations (Bopomofo in China, Kanas in Japan, Hangul in Korea) ?


They just see the printed form, not an encoding, and therefore no IDS. 
Many addresses use handwriting, which has its own variability. 
Variations such as those covered by IDSes are easily recognizable by 
people as being the same as the 'base' character, and OCR systems, if 
they are good enough to decipher handwriting, can handle such cases, 
too. Romanized addresses will be delivered because otherwise it would be 
difficult for foreigners to send anything. Pure Kana should work in 
Japan, although the postal employee will have a second look because it's 
extremely unusual. For Korea, these days, it will be mostly Hangul; I'm 
not sure whether addresses with Hanja would incur a delay. My guess 
would be that Bopomofo wouldn't work in mainland China (might work in 
Taiwan, not sure).


Regards,   Martin.


Re: Unicode Emoji 11.0 characters now ready for adoption!

2018-03-09 Thread Martin J. Dürst via Unicode

On 2018/03/09 10:17, Philippe Verdy via Unicode wrote:

This still leaves the question about how to write personal names !
IDS alone cannot represent them without enabling some "reasonable"
ligaturing (they don't have to match the exact strokes variants for optimal
placement, or with all possible simplifications).
I'm curious to know how China, Taiwan, Singapore or Japan handle this (for
official records or in banks): like our personal signatures (as digital
images), and then using a simplified official record (including the
registration of romanized names)?


This question seems to assume more of a difference between alphabetic 
and ideographic traditions. A name in ideographs, in the same way as a 
name in alphabetic characters, is defined by the characters that are 
used, not by stuff like stroke variants, etc. And virtually all names, 
even before the introduction of computers, and even more after that, use 
reasonably frequent characters.


The difference, at least in Japan, is that some people keep the 
ideograph before simplification in their official records, but they may 
or may not insist on its use in everyday practice. In most cases, both a 
traditional and a simplified variant are available. Examples are 広/廣, 
高/髙, 崎/﨑, and so on. I regularly hit such cases when grading, because 
our university database uses the formal (old) one, where students may 
not care about it and enter the new one on some system where they have 
to enter their name by themselves.


Apart from that, at least in Japan, signatures are used extremely 
rarely; it's mostly stamped seals, which are also kept as images by 
banks,...


Regards,   Martin.



Re: Unicode Emoji 11.0 characters now ready for adoption!

2018-03-04 Thread Martin J. Dürst via Unicode

Hello John,

On 2018/03/01 12:31, via Unicode wrote:

Pen, or brush and paper is much more flexible. With thousands of names 
of people and places still not encoded I am not sure if I would describe 
hans (simplified Chinese characters) as well supported. nor with current 
policy which limits China with over one billion people to submitting 
less than 500 Chinese characters a year on average, and names not being 
all to be added, it is hard to say which decade hans will be well 
supported.


I think this contains several misunderstandings. First, of course 
pen/brush and paper are more flexible than character encoding, but 
that's true for the Latin script, too.


Second, while I have heard that people create new characters for naming 
a baby in a traditional Han context, I haven't heard about this in a 
simplified Han context. And it's not frequent at all, the same way 
naming a baby John in the US is way more frequent than let's say Qvtwzx. 
I'd also assume that China has regulations on what characters can be 
used to name a baby, and that the parents in this age of smartphone 
communication will think at least twice before giving their baby a name 
that they cannot send to their relatives via some chat app.


Third, I cannot confirm or deny the "500 characters a year" limit, but 
I'm quite sure that if China (or Hong Kong, Taiwan,...) had a real need 
to encode more characters, everybody would find a way to handle these.


Due to the nature of your claims, it's difficult to falsify many of 
them. It would be easier to prove them (assuming they were true), so if 
you have any supporting evidence, please provide it.


Regards,   Martin.


John Knightley




Re: Unicode Emoji 11.0 characters now ready for adoption!

2018-02-28 Thread Martin J. Dürst via Unicode

On 2018/02/28 19:38, Janusz S. Bień via Unicode wrote:

On Tue, Feb 27 2018 at 13:45 -0800, announceme...@unicode.org writes:


The 157 new Emoji are now available for adoption, to help the Unicode
Consortium’s work on digitally disadvantaged languages.


I'm quite curious what it the relation between the new emojis and the
digitally disadvantages languages. I see none.


I think this was mentioned before on this list, in particular by Mark:
The money collected from character adoptions (where emoji are a 
prominent target) is (mostly?) used to support work on not-yet-encoded 
(thus digitally disadvantaged) scripts. See e.g. the recent announcement 
at 
http://blog.unicode.org/2018/02/adopt-character-grant-to-support-three.html.


Regards,   Martin.


Re: 0027, 02BC, 2019, or a new character?

2018-02-22 Thread Martin J. Dürst via Unicode

On 2018/02/21 12:15, Michael Everson via Unicode wrote:

I absolutely disagree. There’s a whole lot of related languages out there, and 
the speakers share some things in common. Orthographic harmonization between 
these languages can ONLY help any speaker of one to access information in any 
of the others. That expands people’s worlds. That would be a good goal.


It's definitely a good goal. But it's not rocket science to learn the 
different orthographies. If the languages are similar, then different 
orthographies are just a minor nuisance. As an example, German and Dutch 
also have different orthographies, but that's really a very minor issue 
when learning one language from the other even though these languages 
are very close.


Regards,   Martin.


Re: IDC's versus Egyptian format controls

2018-02-21 Thread Martin J. Dürst via Unicode

On 2018/02/17 08:25, James Kass via Unicode wrote:


Some people studying Han characters use the IDCs to illustrate the
ideographs and their components for various purposes.


Well, as far as I understand, this was their original (and is still 
their main) purpose.



For example:

U-0002A8B8 ꢸ ⿰土土
U-0002A8B9 ꢹ ⿰土凡
U-0002A8BA ꢺ ⿱夂土
U-0002A8BB ꢻ ⿰土亡
U-0002A8BC ꢼ ⿰土无
U-0002A8BD ꢽ ⿰土冇
U-0002A8BE ꢾ ⿰土攴
U-0002A8BF ꢿ ⿰土月
U-0002A8C0 ꣀ ⿰土化
U-0002A8C1 ꣁ ⿰土丰


Is it only me or did you get some of this data wrong?

For me, it looks definitely like
U-0002A8BC ꢼ ⿰土化
rather than U-0002A8BC ꢼ ⿰土无,
and U-0002A8BF ꢿ ⿰土水
rather than U-0002A8BF ꢿ ⿰土月,
and changes seem to be needed for all the others, too. (The descriptions 
seem to be four lines later than the characters where they actually belong.)



It would be probably be disconcerting if the display of those
sequences changed into their respective characters overnight.


Yes indeed.

Regards,   Martin.


Re: Why so much emoji nonsense?

2018-02-14 Thread Martin J. Dürst via Unicode

On 2018/02/15 10:49, James Kass via Unicode wrote:


Yes, except that Unicode "supported" all manner of things being
interchanged by setting aside a range of code points for private use.
Which enabled certain cell phone companies to save some bandwidth by
assigning various popular in-line graphics to PUA code points.


The original Japanese cell phone carrier emoji where defined in the 
unassigned area of Shift_JIS, not Unicode. Shift_JIS doesn't have an 
official private area, but using the empty area by companies had already 
happened for Kanji (by IBM, NEC, Microsoft). Also, there was some 
transcoding software initially that mapped some of the emoji to areas in 
Unicode besides the PUA, based on very simplistic conversion.



The
"problem" was that these phone companies failed to get together on
those PUA code point assignments, so they could not exchange their
icons in a standard fashion between competing phone systems.  [Image
of the world's smallest violin playing.]


Emoji were originally a competitive device. As an example, NTT Docomo 
allowed the ticket service PIA to have an emoji for their service, most 
probably in order to entice them to sign up to participate in the 
original I-mode (first case of Web on mobile phones) service. Of course, 
that specific emoji (or was it several) wasn't encoded in Unicode 
because of trademark issues.


Regards,Martin.


Re: Keyboard layouts and CLDR

2018-01-30 Thread Martin J. Dürst via Unicode

On 2018/01/30 16:18, Philippe Verdy via Unicode wrote:


  - Adding Y to the list of allowed letters after the dieresis deadkey to
produce "Ÿ" : the most frequent case is L'HAŸE-LÈS-ROSES, the official name
of a French municipality when written with full capitalisation, almost all
spell checkers often forget to correct capitalized names such as this one.


Wikipedia has this as L'Haÿ-les-Roses (see 
https://fr.wikipedia.org/wiki/L'Haÿ-les-Roses). It surely would be 
L'HAŸ-LES-ROSES, and not L'HAŸE-LÈS-ROSES, when capitalized. I of course 
know of the phenomenon that in French, sometimes the accents on 
upper-case letters are left out, but I haven't heard of a reverse 
phenomenon yet.


Regards,   Martin.


Re: 0027, 02BC, 2019, or a new character?

2018-01-22 Thread Martin J. Dürst via Unicode

On 2018/01/23 09:55, James Kass via Unicode wrote:


Any Kazakh/Qazaq student ambitious enough to study a foreign language
such as English is already sophisticated enough to easily distinguish
differing digraph values between the two languages.  English speakers
face distinctions such as the difference between the "ch" in "chigger"
versus "chiffon" daily without any apparent danger of confusion.


Well, there are many many easier orthographies than English, so I'd 
understand if the Kazakh don't want to take English as an example.



With
so much push-back, along with technical objections, hopefully the
government will reconsider the apostrophe situation and go with
digraphs or diacritics.


I very much hope so too. One way to avoid confusion is to use one 
specific letter only as the second letter in digraphs. With the current 
orthography, they don't use w and x, so they could use one of these. But 
personally, I'd find accents more visually pleasing.


Regards,   Martin.


Re: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

2017-12-21 Thread Martin J. Dürst via Unicode



On 2017/12/15 07:40, Richard Wordingham via Unicode wrote:

On Mon, 11 Dec 2017 21:45:23 +
Cibu Johny (സിബു)  wrote:



Malayalam could be a similar story. In case of Malayalam, it can be
font specific because of the existence of traditional and reformed
writing styles. A conjunct might be a ligature in traditional; and it
might get displayed with explicit virama in the reformed style. For
example see the poster with word ഉസ്താദ് broken as [u, sa-virama,
ta-aa, da-virama] - as it is written in the reformed style. As per
the proposed algorithm, it would be [u, sa-virama-ta-aa, da-virama].
These breaks would be used by the traditional style of writing.


Working round that seems to be tricky.  The best I can think of is to
have two different locales, traditional and reformed, and hope that the
right font is selected.  It doesn't seem at all straightforward to
work out what the font is doing even from a character to glyph map
without knowing what the glyphs are.  I'm not sure how one should have
the difference designated - language variants, or two scripts?


I'm not at all familiar with Malayalam, but from my experience with 
typing Japanese (where the average kana character requires two 
keystrokes for input, but only one for deleting) would lead to different 
advice. When typing, it is very helpful to know how many times one has 
to hit backspace when making an error. This kind of knowledge is usually 
assimilated into what one calls muscle memory, i.e. it is done without 
thinking about it. I would guess that would be very difficult to 
maintain two different kinds of muscle memory for typing Malayalam. (My 
assumption is that the populations typing traditional and reformed 
writing styles are not disjoint.)


Regards,   Martin.


Re: Word_Break for Hieroglyphs

2017-12-20 Thread Martin J. Dürst via Unicode

On 2017/12/20 17:46, Richard Wordingham via Unicode wrote:


In an implementation that offered genuine whole word selection, and
thus tackled with the challenges of Chinese, Japanese, Korean and
Vietnamese (both scripts, not just CJKV) as well as Thai, I would
expect the selections to be bounded by word boundaries.  Thus, if the
cited line break (labelled by '6') were not in the text, I would expect
double-clicking on the quadrat G37:Aa13:Aa13 to select all three words.


This may be common knowledge to some, but I just had a Japanese document 
open in MS Word, and tried what happened on double-clicking. What it 
does is select same-script runs. This means that a run of kanji, a run 
of hiragana, or a run of katakana (interestingly, the (kata)kana length 
mark is treated as a forth script) is selected. This is of course not 
the same as words, but it can match, and it comes close in terms of 
offering something for editorial convenience while being easy to implement.


Regards,   Martin.


Interesting UTF-8 decoder

2017-10-09 Thread Martin J. Dürst via Unicode

A friend of mine sent me a pointer to
http://nullprogram.com/blog/2017/10/06/, a branchless UTF-8 decoder.

Regards,   Martin.


Re: IBM 1620 invalid character symbol

2017-09-26 Thread Martin J. Dürst via Unicode

On 2017/09/26 22:03, John W Kennedy via Unicode wrote:

I don’t know what your snippet is from, but the normally authoritative IBM 
manual, A26-5706-3, IBM 1620 CPU Model 1 (July, 1965) displays what is clearly 
the Cyrillic letter. Whether it should be regarded as that, or as a distinct 
character, is another question. See 
http://www.bitsavers.org/pdf/ibm/1620/A26-5706-3_IBM_1620_CPU_Model_1_Jul65.pdf


What page?

Regards,   Martin.


Re: Assamese and Unicode.

2017-09-05 Thread Martin J. Dürst via Unicode

Sorry for the long delay of this answer.

On 2017/08/24 07:35, David Faulks via Unicode wrote:

It appears that the Indian government will submit an 'Assamese' proposal.

http://silchar.com/unicode-standard-for-assamese-in-the-offing/

Since everything I know about Assamese Script indicates that it is basically 
the same as Bengali and the Unicode Assamese controversy is derived entirely 
from a sub-nationalistic fit over character and script names, I expect that 
this proposal will not be accepted.


The best thing to do is to have lot's of content in Assamese in Unicode. 
This will show that things just work.


This reminds me of the 1990ies, where many "experts" in Japan were 
complaining that Han Unification would destroy Japanese culture, but 
where writing this using software that used Unicode inside, thus 
providing a proof to the contrary.


So the best thing to happen is to have this discussion in Assamese 
rather than in English, because then people eventually will see that 
there's no problem.


Regards,Martin.


However, 'popular nationalism' will probably be used to attack Unicode then.

David Faulks




Inadvertent copies of test data in L2/17-197 ?

2017-08-07 Thread Martin J. Dürst via Unicode

Hello Henry,

I just had a look at 
http://www.unicode.org/L2/L2017/17197-utf8-retract.pdf to use the test 
data in there for Ruby.


I was under the impression from previous looks at it that it contained a 
lot of test data. However, when I looked at the test data more carefully 
(I had read the text before the test data carefully at least two times 
before, but not looked at the test data in that much detail), I 
discovered that there might be up to 7 copies of the same data. The 
first one starts on page 9, and then there's a new one about every 4 or 
5 pages.


Can you check/confirm? Any idea what might have caused this?

Regards,   Martin.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-08-05 Thread Martin J. Dürst via Unicode

Hello Mark,

On 2017/08/04 09:34, Mark Davis ☕️ wrote:

FYI, the UTC retracted the following.


Thanks for letting us know!

Regards,   Martin.


*[151-C19 ]
Consensus:* Modify
the section on "Best Practices for Using FFFD" in section "3.9 Encoding
Forms" of TUS per the recommendation in L2/17-168
, for Unicode
version 11.0.

Mark


Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-04 Thread Martin J. Dürst via Unicode

On 2017/06/02 04:54, Doug Ewell via Unicode wrote:

Richard Wordingham wrote:


even supporting 6-byte patterns just in case 20.1 bits eventually turn
out not to be enough,


Sorry to be late with this, but if 20.1 bits turn out to not be enough, 
what about 21 bits?


That would still limit UTF-8 to four bytes, but would almost double the 
code space. Assuming (conservatively) that it will take about a century 
to fill up all 17 (well, actually 15, because two are private) planes, 
this would give us another century.


Just one more crazy idea :-(.

Regards,   Martin.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Martin J. Dürst via Unicode

Hello Karl, others,

On 2017/05/27 06:15, Karl Williamson via Unicode wrote:

On 05/26/2017 12:22 PM, Ken Whistler wrote:


On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:

The link provided about the PRI doesn't lead to the comments.



PRI #121 (August, 2008) pre-dated the practice of keeping all the 
feedback comments together with the PRI itself in a numbered directory 
with the name "feedback.html". But the comments were collected 
together at the time and are accessible here:


http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121

Also there was a separately submitted comment document:

http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt

And the minutes of the pertinent UTC meeting (UTC #116):

http://www.unicode.org/L2/L2008/08253.htm

The minutes simply capture the consensus to adopt Option #2 from PRI 
#121, and the relevant action items.


I now return the floor to the distinguished disputants to continue 
litigating history. ;-)


--Ken




The reason this discussion got started was that in December, someone 
came to me and said the code I support does not follow Unicode best 
practices, and suggested I need to change, though no ticket (yet) has 
been filed.  I was surprised, and posted a query to this list about what 
the advantages of the new approach are.


Can you provide a reference to that discussion? I might have missed it 
in December.


There were a number of replies, 
but I did not see anything that seemed definitive.  After a month, I 
created a ticket in Unicode and Markus was assigned to research it, and 
came up with the proposal currently being debated.


Which is to completely reverse the current recommendation in Unicode 
9.0. While I agree that this might help you fending off a bug report, it 
would create chances for bug reports for Ruby, Python3, many if not all 
Web browsers,...



Looking at the PRI, it seems to me that treating an overlong as a single 
maximal unit is in the spirit of the wording, if not the fine print.


In standards, the "fine print" matters.

That seems to be borne out by Markus, even with his stake in ICU, 
supporting option #2.


Well, at http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121, I 
also supported option 2, with code behind it.


Looking at the comments, I don't see any discussion of the effect of 
this on overlong treatments.  My guess is that the effect change was 
unintentional.


I agree that it was probably not considered explicitly. But overlongs 
were disallowed for security reasons, and once the definition of UTF-8 
was tightened, "overlongs" essentially did not exist anymore. 
Essentially, "overlong" is a word like "dragon" or "ghost": Everybody 
knows what it means, but everybody knows they don't exist.


[Just to be sure, by the above, I don't mean that a sequence such as
C0 B0 cannot appear somewhere in some input. But C0 is not UTF-8 all by 
itself, and there is no need to see C0 B0 as a (ghost) sequence.]



So I have code that handled overlongs in the only correct way possible 
when they were acceptable,


No. As long as they were acceptable, they wouldn't have been replaced by 
an FFFD.



and in the obvious way after they became illegal,


Why? A change was necessary from producing an actual character to 
producing some number of FFFDs. It may have been easier to produce just 
a single FFFD, but that depends on how the code was organized.


and now without apparent discussion (which is very much akin to 
"flimsy reasons"), it suddenly was no longer "best practice".


Not 'now', but almost 9 years ago. And not "without apparent 
discussion", but with an explicit PRI.


And that 
change came "rather late in the game".  That this escaped notice for 
years indicates that the specifics of REPLACEMENT CHAR handling don't 
matter all that much.


I agree. You haven't even yet received a ticket yet.


To cut to the chase, I think Unicode should issue a Corrigendum to the 
effect that it was never the intent of this change to say that treating 
overlongs as a single unit isn't best practice.  I'm not sure this 
warrants a full-fledge Corrigendum, though.  But I believe the text of 
the best practices should indicate that treating overlongs as a single 
unit is just as acceptable as Martin's interpretation.


I'd essentially be fine with that, under the condition that the current 
recommendation is maintained as a clearly identified recommendation, so 
that Python3, Ruby, Web standards and browsers, and so on can easily 
refer to it.


Regards,   Martin.

I believe this is pretty much in line with Shawn's position.  Certainly, 
a discussion of the reasons one might choose one interpretation over 
another should be included in TUS.  That would likely have satisfied my 
original query, which hence would never have been posted.

.



Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Martin J. Dürst via Unicode

Hello Markus, others,

On 2017/05/27 00:41, Markus Scherer wrote:

On Fri, May 26, 2017 at 3:28 AM, Martin J. Dürst <due...@it.aoyama.ac.jp>
wrote:


But there's plenty in the text that makes it absolutely clear that some
things cannot be included. In particular, it says




The term “maximal subpart of an ill-formed subsequence” refers to the code
units that were collected in this manner. They could be the start of a
well-formed sequence, except that the sequence lacks the proper
continuation. Alternatively, the converter may have found an continuation
code unit, which cannot be the start of a well-formed sequence.




And the "in this manner" refers to:



A sequence of code units will be processed up to the point where the
sequence either can be unambiguously interpreted as a particular Unicode
code point or where the converter recognizes that the code units collected
so far constitute an ill-formed subsequence.




So we have the same thing twice: Bail out as soon as something is
ill-formed.



The UTF-8 conversion code that I wrote for ICU, and apparently the code
that various other people have written, collects sequences starting from
lead bytes, according to the original spec, and at the end looks at whether
the assembled code point is too low for the lead byte, or is a surrogate,
or is above 10. Stopping at a non-trail byte is quite natural,


I think nobody is debating that this is *one way* to do things, and that 
some code does it.



and
reading the PRI text accordingly is quite natural too.


So you are claiming that you're covered because you produce an FFFD 
"where the converter recognizes that the code units collected so far 
constitute an ill-formed subsequence", except that your converter is a 
bit slow in doing that recognition?


Well, I guess I could come up with another converter that would be even 
slower at recognizing that the code units collected so far constitute an 
ill-formed subsequence. Would that still be okay in your view?


And please note that your "just a bit slow" interpretation might somehow 
work for Unicode 5.2, but it doesn't work for Unicode 9.0, because over 
the years, things have been tightened up, and the standard now makes it 
perfectly clear that C0 by itself is a maximal subpart of an ill-formed 
subsequence. From Section 3.9 of 
http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf:


>>>>
Applying the definition of maximal subparts
for these ill-formed subsequences, in the first case  is a maximal
subpart, because that byte value can never be the first byte of a 
well-formed UTF-8 sequence.

>>>>



Aside from UTF-8 history, there is a reason for preferring a more
"structural" definition for UTF-8 over one purely along valid sequences.


There may be all kinds of reasons for doing things one way or another. 
But there are good reasons why the current recommendation is in place, 
and there are even better reasons for not suddenly reversing it to 
something completely different.




This applies to code that *works* on UTF-8 strings rather than just
converting them. For UTF-8 *processing* you need to be able to iterate both
forward and backward, and sometimes you need not collect code points while
skipping over n units in either direction -- but your iteration needs to be
consistent in all cases. This is easier to implement (especially in fast,
short, inline code) if you have to look only at how many trail bytes follow
a lead byte, without having to look whether the first trail byte is in a
certain range for some specific lead bytes.

(And don't say that everyone can validate all strings once and then all
code can assume they are valid: That just does not work for library code,
you cannot assume anything about your input strings, and you cannot crash
when they are ill-formed.)


[rest of mail mostly OT]

Well, different libraries may make different choices. As an example, the 
Ruby programming language does essentially that: Whenever it finds an 
invalid string, it raises an exception.


Not all processing on all kinds of invalid strings immediately raises an 
exception (because of efficiency considerations). But there are quite 
strong expectations that this happens soon. As an example, when I 
extended case conversion from ASCII only to Unicode (see e.g. 
http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/, 
http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/), I had to go back 
and fix some things because there were explicit tests checking that 
invalid inputs would raise exceptions.


At least for Ruby, this policy of catching problems early rather than 
allowing garbage-in-garbage-out has worked well.




markus


Regards,   Martin.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Martin J. Dürst via Unicode

On 2017/05/25 09:22, Markus Scherer wrote:

On Wed, May 24, 2017 at 3:56 PM, Karl Williamson <pub...@khwilliamson.com>
wrote:


On 05/24/2017 12:46 AM, Martin J. Dürst wrote:


That's wrong. There was a public review issue with various options and
with feedback, and the recommendation has been implemented and in use
widely (among else, in major programming language and browsers) without
problems for quite some time.



Could you supply a reference to the PRI and its feedback?



http://www.unicode.org/review/resolved-pri-100.html#pri121

The PRI did not discuss possible different versions of "maximal subpart",
and the examples there yield the same results either way. (No non-shortest
forms.)


It is correct that it didn't give any of the *examples* that are under 
discussion now. On the other hand, the PRI is very clear about what it 
means by "maximal subpart":


Citing directly from the PRI:

>>>>
The term "maximal subpart of the ill-formed subsequence" refers to the 
longest potentially valid initial subsequence or, if none, then to the 
next single code unit.

>>>>

At the time of the PRI, so-called "overlongs" were already ill-formed.

That change goes back to 2003 or earlier (RFC 3629 
(https://tools.ietf.org/html/rfc3629) was published in 2003 to reflect 
the tightening of the UTF-8 definition in Unicode/ISO 10646).



The recommendation in TUS 5.2 is "Replace each maximal subpart of an

ill-formed subsequence by a single U+FFFD."



You are right.

http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf shows a slightly
expanded example compared with the PRI.

The text simply talked about a "conversion process" stopping as soon as it
encounters something that does not fit, so these edge cases would depend on
whether the conversion process treats original-UTF-8 sequences as single
units.


No, the text, both in the PRI and in Unicode 5.2, is quite clear. The 
"does not fit" (which I haven't found in either text) is clearly 
grounded by "ill-formed UTF-8". And there's no question about what 
"ill-formed UTF-8" means, in particular in Unicode 5.2, where you just 
have to go two pages back to find byte sequences such as , 80>, and  all called out explicitly as ill-formed.


Any kind of claim, as in the L2/17-168 document, about there being an 
option 2a, are just not substantiated. It's true that there are no 
explicit examples in the PRI that would allow to distinguish between 
converting e.g.

FC BF BF BF BF 80
to a single FFFD or to six of these. But there's no need to have 
examples for every corner case if the text is clear enough. In the above 
six-byte sequence, there's not a single potentially valid (initial) 
subsequence, so it's all single code units.




And I agree with that.  And I view an overlong sequence as a maximal
ill-formed subsequence


Can you point to any definition that would include or allow such an 
interpretation? I just haven't found any yet, neither in the PRI nor in 
Unicode 5.2.



that should be replaced by a single FFFD. There's
nothing in the text of 5.2 that immediately follows that recommendation
that indicates to me that my view is incorrect.


I have to agree that the text in Unicode 5.2 could be clearer. It's a 
hodgepodge of attempts at justifications and definitions. And the word 
"maximal" itself may also contribute to pushing the interpretation in 
one direction.


But there's plenty in the text that makes it absolutely clear that some 
things cannot be included. In particular, it says


>>>>
The term “maximal subpart of an ill-formed subsequence” refers to the 
code units that were collected in this manner. They could be the start 
of a well-formed sequence, except that the sequence lacks the proper 
continuation. Alternatively, the converter may have found an 
continuation code unit, which cannot be the start of a well-formed sequence.

>>>>

And the "in this manner" refers to:
>>>>
A sequence of code units will be processed up to the point where the 
sequence either can be unambiguously interpreted as a particular Unicode 
code point or where the converter recognizes that the code units 
collected so far constitute an ill-formed subsequence.

>>>>

So we have the same thing twice: Bail out as soon as something is 
ill-formed.




Perhaps my view is colored by the fact that I now maintain code that was
written to parse UTF-8 back when overlongs were still considered legal
input.


Thanks for providing this information. That's a lot more useful than 
"feels right", which was given as a reason on this list before.




An overlong was a single unit.  When they became illegal, the code
still considered them a single unit.


That's fine for your code. I might do the same (or not) if I were you, 
because one indeed never knows in which situation some code is used, 

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-24 Thread Martin J. Dürst via Unicode

On 2017/05/24 05:57, Karl Williamson via Unicode wrote:

On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote:



Adding a "recommendation" this late in the game is just bad standards
policy.



Unless I misunderstand, you are missing the point.  There is already a
recommendation listed in TUS,


That's indeed correct.



and that recommendation appears to have
been added without much thought.


That's wrong. There was a public review issue with various options and 
with feedback, and the recommendation has been implemented and in use 
widely (among else, in major programming language and browsers) without 
problems for quite some time.




There is no proposal to add a
recommendation "this late in the game".


True. The proposal isn't for an addition, it's for a change. The "late 
in the game" however, still applies.


Regards,   Martin.


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-23 Thread Martin J. Dürst via Unicode

Hello Mark,

On 2017/05/22 01:37, Mark Davis ☕️ via Unicode wrote:

I actually didn't see any of this discussion until today.


Many thanks for chiming in.


(
unicode@unicode.org mail was going into my spam folder...) I started
reading the thread, but it looks like a lot of it is OT,


As is quite usual on mailing list :-(.


so just scanned
some of them.

A few brief points:

   1. There is plenty of time for public comment, since it was
targeted at *Unicode
   11*, the release for about a year from now, *not* *Unicode 10*, due this
   year.
   2. When the UTC "approves a change", that change is subject to comment,
   and the UTC can always reverse or modify its approval up until the meeting
   before release date. *So there are ca. 9 months in which to comment.*


This is good to hear. What's the best way to submit such comments?


   3. The modified text is a set of guidelines, not requirements. So no
   conformance clause is being changed.
   - If people really believed that the guidelines in that section should
  have been conformance clauses, they should have proposed that at
some point.


I may have missed something, but I think nobody actually proposed to 
change the recommendations into requirements. I think everybody 
understands that there are several ways to do things, and situations 
where one or the other is preferred. The only advantage of changing the 
current recommendations to requirements would be to make it more 
difficult for them to be changed.


I think the situation at hand is somewhat special: Recommendations are 
okay. But there's a strong wish from downstream communities such asWeb 
browser implementers and programming language/library implementers to 
not change these recommendations. Some of these communities have 
stricter requirement for alignment, and some have followed longstanding 
recommendations in the absence of specific arguments for something 
different.


Regards,   Martin.


  - And still can proposal that — as I said, there is plenty of time.


Mark

On Wed, May 17, 2017 at 10:41 PM, Doug Ewell via Unicode <
unicode@unicode.org> wrote:


Henri Sivonen wrote:


I find it shocking that the Unicode Consortium would change a
widely-implemented part of the standard (regardless of whether Unicode
itself officially designates it as a requirement or suggestion) on
such flimsy grounds.

I'd like to register my feedback that I believe changing the best
practices is wrong.


Perhaps surprisingly, it's already too late. UTC approved this change
the day after the proposal was written.

http://www.unicode.org/L2/L2017/17103.htm#151-C19

--
Doug Ewell | Thornton, CO, US | ewellic.org







--
Prof. Dr.sc. Martin J. Dürst
Department of Intelligent Information Technology
College of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-16 Thread Martin J. Dürst via Unicode

Hello everybody,

[using this mail to in effect reply to different mails in the thread]

On 2017/05/16 17:31, Henri Sivonen via Unicode wrote:

On Tue, May 16, 2017 at 10:22 AM, Asmus Freytag  wrote:



Under what circumstance would it matter how many U+FFFDs you see?


Maybe it doesn't, but I don't think the burden of proof should be on
the person advocating keeping the spec and major implementations as
they are. If anything, I think those arguing for a change of the spec
in face of browsers, OpenJDK, Python 3 (and, likely, "etc.") agreeing
with the current spec should show why it's important to have a
different number of U+FFFDs than the spec's "best practice" calls for
now.


I have just checked (the programming language) Ruby. Some background:

As you might know, Ruby is (at least in theory) pretty 
encoding-independent, meaning you can run scripts in iso-8859-1, in 
Shift_JIS, in UTF-8, or in any of quite a few other encodings directly, 
without any conversion.


However, in practice, incl. Ruby on Rails, Ruby is very much using UTF-8 
internally, and is optimized to work well that way. Character encoding 
conversion also works with UTF-8 as the pivot encoding.


As far as I understand, Ruby does the same as all of the above software, 
based (among else) on the fact that we followed the recommendation in 
the standard. Here are a few examples (sorry for the linebreaks 
introduced by mail software):


$ ruby -e 'puts "\xF0\xaf".encode("UTF-16BE", invalid: :replace).inspect'
#=>"\uFFFD"

$ ruby -e 'puts "\xe0\x80\x80".encode("UTF-16BE", invalid: 
:replace).inspect'

#=>"\uFFFD\uFFFD\uFFFD"

$ ruby -e 'puts "\xF4\x90\x80\x80".encode("UTF-16BE", invalid: 
:replace).inspect'

#=>"\uFFFD\uFFFD\uFFFD\uFFFD"

$ ruby -e 'puts "\xfd\x81\x82\x83\x84\x85".encode("UTF-16BE", invalid: 
:replace).inspect

#=>"\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD"

$ ruby -e 'puts "\x41\xc0\xaf\x41\xf4\x80\x80\x41".encode("UTF-16BE", 
invalid: :replace).inspect'

#=>"A\uFFFD\uFFFDA\uFFFDA"

This is based on http://www.unicode.org/review/pr-121.html as noted at
https://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/test/ruby/test_transcode.rb?revision=56516=markup#l1507
(for those having a look at these tests, in Ruby's version of 
assert_equal, the expected value comes first (not sure whether this is 
called little-endian or big-endian :-), but this is a decision where the 
various test frameworks are virtually split 50/50 :-(. ))


Even if the above examples and the tests use conversion to UTF-16 (in 
particular the BE variant for better readability), what happens 
internally is that the input is analyzed byte-by-byte. In this case, it 
is easiest to just stop as soon as something is found that is clearly 
invalid (be this a single byte or something longer). This makes a 
data-driven implementation (such as the Ruby transcoder) or one based on 
a state machine (such as http://bjoern.hoehrmann.de/utf-8/decoder/dfa/) 
more compact.


In other words, because we never know whether the next byte is a valid 
one such as 0x41, it's easier to just handle one byte at a time if this 
way we can avoid lookahead (which is always a good idea when parsing).


I agree with Henri and others that there is no need at all to change the 
recommendation in the standard that has been stable for so long (close 
to 9 years).


Because the original was done on a PR 
(http://www.unicode.org/review/pr-121.html), I think this should at 
least also be handled as PR (if it's not dropped based on the discussion 
here).


I think changing the current definition of "maximal subsequence" is a 
bad idea, because it would mean that one wouldn't know what one was 
speaking about over the years. If necessary, new definitions should be 
introduced for other variants.


I agree with others that ICU should not be considered to have a special 
status, it should be just one implementation among others.


[The next point is a side issue, please don't spend too much time on 
it.] I find it particularly strange that at a time when UTF-8 is firmly 
defined as up to 4 bytes, never including any bytes above 0xF4, the 
Unicode consortium would want to consider recommending that 84 85> be converted to a single U+FFFD. I note with agreement that 
Markus seems to have thoughts in the same direction, because the 
proposal (17168-utf-8-recommend.pdf) says "(I suppose that lead bytes 
above F4 could be somewhat debatable.)".



Regards,Martin.


Re: Proposal to add standardized variation sequences for chess notation

2017-04-12 Thread Martin J. Dürst via Unicode

On 2017/04/12 00:44, Philippe Verdy via Unicode wrote:


Some Asian chess boards include also diagonal lines or dots on top of their
crossing (notably 9x9 boards are subdivided into nine 3x3 subgroups by such
dots). These chess boards do not alternate white and black "squares" ;
beside this, the cells may also be rectangular (longer vertically than
horizontally)


[mostly OT]

On Go boards, the grid cells are definitely rectangular, not square. The 
reason for this is that boards are usually looked at at an angle, and 
having the cells be higher than wide makes them appear (close to) 
square. However, because diagrams are usually viewed at close to a right 
angle, Go diagrams use squares, not rectangles.


Regards,   Martin.


Re: Unicode vs. Unikod

2017-04-10 Thread Martin J. Dürst via Unicode

Hello Janusz,

I think you should report this problem to 
http://www.unicode.org/reporting.html. That way, it gets tracked 
appropriately. This list is for discussion, not for bug fixes.


Regards,   Martin.

On 2017/04/10 18:54, Janusz S. Bień wrote:


This is a long overdue issue, but better late than never.

To make a long story short, I think that the word "Unikod" should not
be used in the Polish translation of "What is Unicode":

http://www.unicode.org/standard/translations/polish.html

The word "Unikod", to the best of my knowledge, has been coined long
ago by Piotr Trzcionkowski, who registered also the domain
www.unikod.pl used to advocate Unicode in Poland.

"Unicode" as a trademark should not be translated, for me this is
quite obvious.

This is actually the case in almost all language versions of "What is
Unicode" using the Latin script (with the exception of Esperanto and
Lithuanian, where it can be probably justified by some grammar rules,
and Polish). This is also seems to be the case in various language
versions of Wikipedia (I've checked only some of them) with the
exception of the Polish one which uses "Unikod" as the primary entry.

The occurence of "Unikod" on the Unicode site may be interpreted as an
official acceptance of this equivalent. I hope this is not the case.  I
would like to clarify the matter before engaging myself in a
discussion about introducing the "Unicode" primary entry in Polish
Wikipedia.

You can check the usage of "Unicode" and "Unikod" in Polish not only
with Google but also in the National Corpus of Polish:

http://nkjp.pl/

There are 786 occurences of "Unicode" coming mainly from published
books and 102 occurences of "Unikod", mainly in Usenet postings and
Wikipedia discussions.

Grammatical Dictionary of Polish contains only "unicode":

http://sgjp.pl/leksemy/#73537/unicode

Polish versions of Windows use "Unicode". A localization dictionary

http://www.btinfodictionary.com/

also preserves "Unicode".

Actually I don't mind using "Unikod" (or better "unikod") informally,
as e.g. spelling "Unikodem" is simpler than "Unicode'em" (instrumental
singular) and "Unikodzie" looks better in Polish than "Unicodzie"
(locative singular), moreover there is no doubt how to pronounce it.
This is probably the reason why, to my surprise, the word was
introduced also in some other Slavonic languages, e.g.
https://en.wiktionary.org/wiki/Unikod.

My point is only that both "What is Unicode" and Polish Wikipedia
primary entry should use the original spelling.

Best regards

Janusz



--
Prof. Dr.sc. Martin J. Dürst
Department of Intelligent Information Technology
College of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan


Re: Standaridized variation sequences for the Desert alphabet?

2017-04-06 Thread Martin J. Dürst

Hello Michael,

[I started to write this mail quite some time ago. I decided to try to 
let things cool down a bit by waiting a day or two, but it has become 
more than a week now.]


On 2017/03/29 22:08, Michael Everson wrote:

Martin,

It’s as though you’d not participated in this work for many years, really.


Well, looking back, my time commitment to Unicode has definitely varied 
over the years. But that might be true for everybody.


What's more important is that Unicode covers such a wide range of areas, 
and not everybody has the same experience or knowledge. If we did, we 
wouldn't need to work together; it would be okay to just have one of us. 
Indeed, what's really very valuable and interesting in this work is the 
many very varied backgrounds and experiences everybody has.


In addition to variations in background, we also have a wide variety of 
ways of thinking, e.g. ranging from abstract to concrete, and so on.




On 29 Mar 2017, at 11:12, Martin J. Dürst <due...@it.aoyama.ac.jp> wrote:



- That suggests that IF this script is in current use,


You don’t even know? You’re kidding, right?


Everything is relative. And without being part of the user community, 
it's difficult to make any guesses.




- As far as we have heard (in the course of the discussion, after questioning 
claims made without such information), it seems that:


Yeah, it doesn’t “seem” anything but a whole lot of special pleading to bolster 
your rigid view that the glyphs in question can be interchangeable because of 
the sounds they may represent.


I don't remember every claiming that the glyphs must be used 
interchangeably, only that we should carefully examine whether they are 
or not, and that because they represent the same sound (in a phonetic 
alphabet, as it is) and are shown in the same position in alphabet 
tables, we shouldn't a priori exclude such a possibility.




 - There may not be enough information to understand how the creators and early 
users of the script saw this issue,


Um, yeah. As if there were for Phoenician, or Luwian hieroglyphs, right?


Well, there's well over an order of magnitude difference in the time 
scales involved. The language that Deseret is used to write is still in 
active use, including in this very discussion. Quite different from 
Phoenician or Luwian hieroglyphs.


In addition, we have meta-information such as alphabet tables, which we 
may not have for the scripts you mention, as well as the fact that 
printing technology may have forced a better identification of what's a 
character and what not than inscriptions and other older technologies.




 - Similarly, there seem to be not enough modern practitioners of the script 
using the ligatures that could shed any light on the question asked in the 
previous item in a historical context,


Completely irrelevant. Nobody worried about the number of modern users of the 
Insular letters we encoded. Why put such a constraints on users of Deseret? Ꝺꝺ 
Ꝼꝼ Ᵹᵹ Ꝿ Ꞃꞃ Ꞅꞅ Ꞇꞇ.


Because it's modern users, and future users, not users some hundred 
years or so ago, that will use the encoding. In the case of Insular 
letters, my guess is that nobody wants to translate/transcribe xkcd, for 
example, whereas there is such a transcription for Deseret:

http://www.deseretalphabet.info/XKCD/


first apparently because there are not that many modern practitioners at all, 
and second because modern practitioners seem to prefer spelling with individual 
letters rather than using the ligatures.


This is equally ridiculous. John Jenkins chooses not write the digraphs in the 
works which he transcribed, because that’s what *he* chooses. He doesn’t speak 
for anyone else who may choose to write in Deseret, and your assumption that 
“modern practitioners” do this is groundless.


You wrote:



Most readers and writers of Deseret today use the shapes that are in 
their fonts, which are those in the Unicode charts, and most texts 
published today don’t use the EW and OI ligatures at all, because that’s 
John Jenkins’ editorial practice.




So I was wrong to write "modern practitioners", and should have written 
"modern publishers" or "modern published texts". Or is the impression 
that I get from what you wrote above wrong that most texts published 
these days are edited by John, or by people following his practice?




It also ignores the fact that the script had a reform and that the value of 
separate encodings for the various characters is of value to those studying the 
provenance and orthographic practices of those who wrote Deseret when it was in 
active use.


I don't remember denying the value of separate encodings for historic 
research. I only wanted to make sure that present-day use isn't 
inconvenienced to make historic research easier. If the claims are 
correct that present-day usage is mostly a reconstruction based on the 
Unicode encoding and the Unicode sample glyphs, then I'm 

Re: Proposal to add standardized variation sequences for chess notation

2017-04-05 Thread Martin J. Dürst

On 2017/04/05 23:49, Michael Everson wrote:


Oh, here is the answer to your question. It took me 15 seconds to change the 
background and text colour in Quark XPress. It has nothing to do with the 
proposal for variation sequences.

http://evertype.com/standards/unicode-list/looking-glass-yellow-blue.png


[OT]
It looks neat. But I noticed three very small gaps in each of the top 
and bottom borders. Also, it's probably not the best choice of colors, 
because my eyes tend to associate the yellow figures with white, and the 
blue ones with black, but thinking it through makes it clear that it's 
the other way round.


Regards,   Martin.


Re: Proposal to add standardized variation sequences for chess notation

2017-04-04 Thread Martin J. Dürst

On 2017/04/03 23:41, Kent Karlsson wrote:


Hence the chess board lines should be displayed in a strong left-to-right
context (either via bidi markup characters, or via some higher order
bidi markup mechanism, such as the "bidi" attribute in HTML). Though in
most cases (not Arabic/Hebrew/... document), the bidi context will default
to left-to right...


There never was a "bidi" attribute in HTML. You probably mean the "dir" 
attribute.


Regards,   Martin.


Re: Proposal to add standardized variation sequences for chess notation

2017-04-03 Thread Martin J. Dürst

On 2017/04/03 01:27, Richard Wordingham wrote:


We seem to agree that it should be a graphic modification, rather than
as semantic modification.  The question I pose is, "Is it just a
graphic modification in this case?".  I'm not convinced that it is.  A
player starts with two non-interchangeable bishops.  
could only refer the white bishop that is restricted to black squares.
That's a semantic difference.


That applies only to the bishop, and only in standard chess and those 
chess variants that keep the same restrictions. It's easily possible to 
imagine or invent variants where bishops can move differently, and it 
would be weird to use a semantic difference (e.g. different characters) 
for bishops, but a variant selector for other pieces. Also it would be 
weird to try e.g. to "semantically" distinguish the two rooks, even if 
they are two different actual chess pieces on an actual board.




The immediate parallel that comes to mind is the ideographic square.  A
sequence of CJK ideographs should be a monospace sequence - and that is
the major point of most of the ASCII clones with 'IDEOGRAPHIC' or
'FULLWIDTH' in their names.  The uniform width is a key part of the
semantic of the seqeunces being discussed.


The full width/half width distinction mostly is a legacy (roundtrip) issue.

Regards,   Martin.


Re: Unicode Emoji 5.0 characters now final

2017-03-30 Thread Martin J. Dürst

On 2017/03/30 06:17, Christoph Päper wrote:

Mark Davis ☕️ :



That isn't really the case. In particular, vendors can propose adding
additional subdivisions to the recommended list.


Awesome, "vendors" can do that. (._.m)

If I made an open-source emoji font that contained flags for all of the 5000ish
ISO 3166-2 codes that actually map to one, would I automatically be considered a
vendor?


I don't think so. But if you want to get more flags listed, then 
creating actual flags, with suitable licenses, and telling others to use 
them and tell other, and so on, may easily reach vendors sooner or later.




- 
- 
- 
-  <-


The last one currently already has support for UK countries, US states and
Canadian provinces. Go figure.


And most if not all of these flags are from Wikimedia. So that shows 
that open source has some influence, even without money.


Regards,   Martin.


Re: Standaridized variation sequences for the Desert alphabet?

2017-03-29 Thread Martin J. Dürst

Hello everybody,

Let me start with a short summary of where I think we are at, and how we 
got there.


- The discussion started out with two letters,
  with two letter forms each. There is explicit talk of the
  40-letter alphabet and glyphs in the Wikipedia page, not
  of two different letters.
- That suggests that IF this script is in current use, and the
  shapes for these diphthongs are interchangeable (for those
  who use the script day-to-day, not for meta-purposes such
  as historic and typographic texts), keeping things unified
  is preferable.
- As far as we have heard (in the course of the discussion,
  after questioning claims made without such information),
  it seems that:
  - There may not be enough information to understand how the
creators and early users of the script saw this issue,
on a scale that may range between "everybody knows these
are the same, and nobody cares too much who uses which,
even if individual people may have their preferences in
their handwriting" to something like "these are different
choices, and people wouldn't want their texts be changed
in any way when published".
  - Similarly, there seem to be not enough modern practitioners
of the script using the ligatures that could shed any
light on the question asked in the previous item in a
historical context, first apparently because there are not
that many modern practitioners at all, and second because
modern practitioners seem to prefer spelling with
individual letters rather than using the ligatures.
- IF the above is true, then it may be that these ligatures
  are mostly used for historic purposes only, in which case
  it wouldn't do any harm to present-day users if they were separated.

If the above is roughly correct, then it's important that we reached 
that conclusion after explicitly considering the potential of a split to 
create inconvenience and confusion for modern practitioners, not after 
just looking at the shapes only, coming up with separate historical 
derivations for each of them, and deciding to split because history is 
way more important than modern practice.


In that light, some more comments lower down.

On 2017/03/28 22:56, Michael Everson wrote:

On 28 Mar 2017, at 11:39, Martin J. Dürst <due...@it.aoyama.ac.jp> wrote:



An æ ligature is a ligature of a and of e. It is not some sort of pretzel.


Yes. But it's important that we know that because we have been faced 
with many cases where "æ" and "ae" were used interchangeably. For 
somebody not knowing the (extended) Latin alphabet and its usages, they 
might easily see more of a pretzel and less of 'a' and 'e'. I might try 
some experiments with some of my students (although I'm using "formulæ" 
in my lecture notes, and so they might already be too familiar with the 
"æ").


Also, if it were the case that shapes like "æ" and "œ" were used 
interchangeably across all uses of the Latin alphabet, I'm quite sure we 
would encode it with one code point rather than two, even if some 
researchers might claim that the later was derived from an "o" rather 
than an "ɑ", or even if we knew it was derived from an "o" (as we know 
for the ß).




What Deseret has is this:

10426 DESERET CAPITAL LETTER LONG OO WITH STROKE
* officially named “ew” in the code chart
* used for ew in earlier texts
10427 DESERET CAPITAL LETTER SHORT AH WITH STROKE
* officially named “oi” in the code chart
* used for oi in earlier texts
1 DESERET CAPITAL LETTER LONG AH WITH STROKE
* used for oi in later texts
1 DESERET CAPITAL LETTER SHORT OO WITH STROKE
* used for ew in later texts


Currently, it has this:

10426 Ц DESERET CAPITAL LETTER OI

10427 Ч DESERET CAPITAL LETTER EW

My personal opinion is that names are mostly hints, and not too much 
should be read into them, but if anything, the names in the current 
charts would suggest that the encoding is for the 39th/40th letter of 
the Deseret alphabet, whatever its shape, not for some particular shape.


And you know as well as I do that we can't change names. So if we split, 
we might end up with something like:


10426 Ц DESERET CAPITAL LETTER OI

10427 Ч DESERET CAPITAL LETTER EW

1 <ЃІ> DESERET CAPITAL LETTER VARIANT OI

1 <ІЋ> DESERET CAPITAL LETTER VARIANT EW



Don’t go trying to tell me that LONG OO WITH STROKE and SHORT OO WITH STROKE 
are glyph variants of the same character.

Don’t go trying to tell me that LONG AH WITH STROKE and SHORT AH WITH STROKE 
are glyph variants of the same character.


We have just established that there are no characters with such names in 
the standard. It's not the names or the history that I'm arguing.




To do so is to show no understanding of the history of writing systems at all.


What I'd agree to is that cases where shapes with diff

Re: Standaridized variation sequences for the Desert alphabet?

2017-03-28 Thread Martin J. Dürst

On 2017/03/29 01:47, Philippe Verdy wrote:

2017-03-28 18:30 GMT+02:00 Asmus Freytag :


On 3/28/2017 6:56 AM, Michael Everson wrote:


An æ ligature is a ligature of a and of e. It is not some sort of pretzel.


We need a pretzel emoji.


We need a broken tooth emoji too !


I prefer soft pretzels!

Regards,   Martin.


Re: Unicode Emoji 5.0 characters now final

2017-03-28 Thread Martin J. Dürst

Hello Doug,

On 2017/03/29 03:41, Doug Ewell wrote:


If this story sounds vaguely familiar to old-timers, it's exactly the
path that was followed the last time Plane 14 tag characters were under
discussion, between 1998 and 2000: someone wrote an RFC to embed
language tags in plain text using invalid UTF-8 sequences; Unicode
responded by introducing a proper, conformant mechanism to use Plane 14
characters instead; then the conformant replacement mechanism itself was
deprecated and users were told to use out-of-band tagging, exactly what
the original RFC sought to avoid.


I think there is some missing information here. First, the original 
proposal that used invalid UTF-8 sequences never was an RFC, only an 
Internet Draft. But what's more important, the protocol that motivated 
all this work (ACAP) never went anywhere. Nor did any other use of the 
plane 14 language tag characters get any kind of significant traction. 
That lead to depreciation, because it would have been a bad idea to let 
people think that the information in these taggings would actually be used.


For some people (including me), that was always seen as the likely 
outcome; the language tag characters were mostly introduced as a 
defensive mechanism (way better than invalid UTF-8) rather than 
something we hoped everybody would jump on. Putting them on plane 14 
(which meant that it would be four bytes for each character, and 
therefore quite a lot of bytes for each tag) was part of that message.




"Not recommended," "not standard," "not interoperable," or any other
term ESC settles on for the 5000+ valid flag sequences that are not
England, Scotland, and Wales is just a short, easy step away from
deprecation for these as well.


I think the situation is vastly different here. First, the Consortium 
never officially 'activated' any subdivision flags, so it would be 
impossible to deprecate them. Second, we already see some pressure (on 
this list) to 'recommend' more of these, and I guess the vendors and the 
Consortium will give in to this pressure, even if slowly and to some 
extent quite reluctantly. It's anyone's bet in what time frame and order 
e.g. the flags of California and Texas will be 'recommended'. But I have 
personally no doubt that these (and quite a few others) will eventually 
make it, even if I have mixed feelings about that.


Regards,   Martin.


Re: Standaridized variation sequences for the Desert alphabet?

2017-03-28 Thread Martin J. Dürst

On 2017/03/27 21:59, Michael Everson wrote:

On 27 Mar 2017, at 08:05, Martin J. Dürst <due...@it.aoyama.ac.jp> wrote:


Consider 2EBC ⺼ CJK RADICAL MEAT and 2E9D ⺝ CJK RADICAL MOON which are 
apparently really supposed to have identical glyphs, though we use an 
old-fashioned style in the charts for the former. (Yes, I am of course aware 
that there are other reasons for distinguishing these, but as far as glyphs go, 
even our standard distinguishes them artificially.)


"apparently", maybe. Let's for a moment leave aside the radicals themselves, 
which are to a large extent artificial constructs.


I do stipulate not being a CJK expert. But those are indeed different due to 
their origins, however similar their shapes are.


Except for the radicals themselves, I haven't found a contrasting pair. 
What I think we would need to find to influence the current 
argumentation (except for general "history is important", which I think 
we all agree) is a case of a character that originally existed both with 
a MEAT radical and a MOON radical, but has only a single usage. Then 
whether there were one or two code points would provide an analog for 
the situation we have at hand.


Also note that there is a difference in meaning. The characters with 
MEAT radicals mostly refer to body parts and organs. The characters with 
MOON radicals are mostly time-related.




Let's look at the actual characters with these radicals (e.g. U+6709,... for MOON and 
U+808A,... for MEAT), in the multi-column code charts of ISO 10646. There are some 
exceptions, but in most cases, the G/J/K columns show no difference (i.e. always the ⺝ 
shape, with two horizontal bars), whereas the H/T/V columns show the ⺼ shape (two 
downwards slanted bars) for the "MEAT" radical and the ⺝ shape for the moon 
radical. So whether these radicals have identical glyphs depends on typographic 
tradition/font/…


They are still always very similar, right?


Similarity is in the eye of the beholder (or the script).

Sometimes, a little dot or hook is irrelevant. Sometimes it's the single 
difference that makes it a totally different character.




In Japan, many people may be rather unaware of the difference, whereas in 
Taiwan, it may be that school children get drilled on the difference.


That’s interesting.


Not necessarily for the poor Taiwanese students, and not necessarily for 
the Japanese who try to find a character in a dictionary ordered by 
radical :-(.




Changing to a different font in order to change one or two glyphs is a 
mechanism that we have actually rejected many times in the past. We have 
encoded variant and alternate characters for many scripts.


Well, yes, rejected many times in cases where that was appropriate. But also 
accepted many times, in cases that we may not even remember, because they may 
not even have been made explicitly.


Do come up with examples if you have any.


I had the following in mind:


The roman/italic a/ɑ and g/ɡ distinctions (the later code points only used to 
show the distinction in plain text, which could as well be done descriptively),


Aa and Ɑɑ are used contrastively for different sounds in some languages and in the IPA. Ɡɡ 
is not, to my knowledge, used contrastively with Gg (except that ɡ can only mean /ɡ/, while 
orthographic g can mean /ɡ/, /dʒ/, /x/ etc. But g vs ɡ is reasonably analogous to Ц and 
ЃІ being used for /juː/.


The contrastive use *in some languages or notations* (IPA) is the reason 
these are separately encoded. The fact that these are not contrastively 
used in most major languages is responsible for the fact that they don't 
use different code points when used in these languages. It would be a 
real hassle to have to change from g to ɡ when switching e.g. from Times 
Roman to Times Italic.


In Deseret, we are still missing any contrastive usage, so that suggests 
to be careful with encoding.




as well as a large number of distinctions in Han fonts, come to my mind.


It's difficult to show these distinctions, because they are NOT 
separately encoded, but three-stroke and four-stroke grass radical is 
the most well known.




And the same goes for the /juː/ ligatures. The word tube /tjuːb/ can be written TYŪB 
ГЏЅВ or ГЧВ or Г<ІЋ>В. But the unligated the sequences would be pronounced 
differently: ГЏЅВ /tjuːb/ and ГІЅВ /tɪuːb/ and ГІЋВ /tɪʊb/.


Ah, I see. So we seem to have five different ways (counting the two 
ligature variants) of writing the same word, with three different 
pronunciations. The important question is whether the two ligatures do 
imply any difference in pronunciation (as opposed to time of writing or 
author/printer preference), i.e. whether the ligated sequences ГЧВ or 
Г<ІЋ>В are pronounced differently (not by a phonologist but by an 
average user).




Is the choice of variant up to the author (for which variants), or is it the 
editor or printer who makes the choice (for which variants)?



Re: Standaridized variation sequences for the Desert alphabet?

2017-03-28 Thread Martin J. Dürst

On 2017/03/28 01:20, Michael Everson wrote:


Ken transcribes into modern type a letter by Shelton dated 1859, in which “boy” is written В<ЃІ>, 
“few” as Й<ІЋ>, “truefully” [sic] as ГС<ІЋ>ЙЋТІ, and “you” as Џ<ІЋ>.


These are all 1859 variants, yes? That would just show that these 
variants existed (which I think nobody in this discussion has doubted), 
but not that there was contrasting use. And is that letter hand-written 
or printed?


Regards,Martin.


Re: Standaridized variation sequences for the Desert alphabet?

2017-03-28 Thread Martin J. Dürst

On 2017/03/28 01:49, Michael Everson wrote:


Sorry, but typographic control of that sort is grand for typesetting, where you 
can select ranges of text and language-tag it (assuming your program accepts 
and supports all the language tags you might need (which they don’t)) and you 
can select fonts which have all the trickery baked into them (hardly any do) 
and then… can you use this in file names? In your plain-text databases? In your 
text messages?


Do you think that the 1855/1859 distinction is needed in file names? In 
text messages? It may help in some kinds of databases, but it may also 
be possible to just tag each piece of text in the database with "1855" 
or "1859" if that distinction is important (e.g. for historical 
documents). As far as I understand, we are still looking for actual 
texts that use both shapes of the same ligature concurrently.


Regards,   Martin.


Re: Standaridized variation sequences for the Desert alphabet?

2017-03-28 Thread Martin J. Dürst

I agree with Alstair.

The list of font technology options was mostly to show that there are 
already a lot of options (some might even say too many), so font 
technology doesn't really limit our choices.


Regards,   Martin.

On 2017/03/27 23:04, Alastair Houghton wrote:

On 27 Mar 2017, at 10:14, Julian Bradfield  wrote:


I contend, therefore, that no decision about Unicode should take into
account any ephemeral considerations such as this year's electronic
font technology, and that therefore it's not even useful to mention
them.


I’d disagree with that, for two reasons:

1. Unicode has to be usable *today*; it’s no good designing for some kind of 
hyper-intelligent AI-based font technology a thousand years hence, because we 
don’t have that now.  If it isn’t usable today for any given purpose, people 
won’t use it for that, and will adopt alternative solutions (like using images 
to represent text).

2. “This year’s electronic font technology” is actually quite powerful, and is 
unlikely to be supplanted by something *less* powerful in future.  There is an 
argument about exactly how widespread support for it is (for instance, simple 
text editors are clearly lacking in support for stylistic alternates, except 
possibly on the Mac where there’s built-in support in the standard text edit 
control), but again I think it’s reasonable to expect support to grow over 
time, rather than being removed.

I don’t think it’s unreasonable, then, to point out that mechanisms like 
stylistic or contextual alternates exist, or indeed for that knowledge to 
affect a decision about whether or not a character should be encoded, *bearing 
in mind* the likely direction of travel of font and text rendering support in 
widely available operating systems.

All that said, I’d definitely defer to others on the subject of whether or not 
Unicode needs the Deseret characters being discussed here.  That’s very much 
not my field.

Kind regards,

Alastair.

--
http://alastairs-place.net


Re: Standaridized variation sequences for the Desert alphabet?

2017-03-28 Thread Martin J. Dürst

Hello Michael, others,

On 2017/03/27 21:07, Michael Everson wrote:

On 27 Mar 2017, at 06:42, Martin J. Dürst <due...@it.aoyama.ac.jp> wrote:


The characters in question have different and undisputed origins, undisputed.


If you change that to the somewhat more neutral "the shapes in question have 
different and undisputed origins", then I'm with you. I actually have said as much 
(in different words) in an earlier post.


And what would the value of this be? Why should I (who have been doing this for 
two decades) not be able to use the word “character” when I believe it correct? 
Sometimes you people who have been here for a long time behave as though we had 
no precedent, as though every time a character were proposed for encoding it’s 
as thought nothing had ever been encoded before.


I didn't say that you have to change words. I just said that I could 
agree to a slightly differently worded phrase.


And as for precedent, the fact that we have encoded a lot of characters 
in Unicode doesn't mean that we can encode more characters without 
checking each and every single case very carefully, as we are doing in 
this discussion.




The sharp s analogy wasn’t useful because whether ſs or ſz users can’t tell 
either and don’t care.


Sorry, but that was exactly the point of this analogy. As to "can't 
tell", it's easy to ask somebody to look at an actual ß letter and say 
whether the right part looks more like an s or like a z. On the other 
hand, users of Deseret may or may not ignore the difference between the 
1855 and 1859 shapes when they read. Of course they will easily see 
different shapes, but what's important isn't the shapes, it's what they 
associate it with. If for them, it's just two shapes for one and the 
same 40th letter of the Deseret alphabet, then that is a strong 
suggestion for not encoding separately, even if the shapes look really 
different.




No Fraktur fonts, for instance, offer a shape for U+00DF that looks like an ſs. 
And what Antiiqua fonts do, well, you get this:

https://en.wikipedia.org/wiki/%C3%9F#/media/File:Sz_modern.svg


Yes. And we are just starting to collect evidence for Deseret fonts.



And there’s nothing unrecognizable about the ſɜ (< ſꝫ (= ſz)) ligature there.


Well, not to somebody used to it. But non-German users quite often use a 
Greek β where they should use a ß, so it's no surprise people don't 
distinguish the ſs and ſz derived glyphs.




The situation in Deseret is different.


The graphic difference is definitely bigger, so to an outsider, it's 
definitely quite impossible to identify the pairs of shapes. But that 
does in no way mean that these have to be seen as different characters 
(rather than just different glyphs) by insiders (actual users).


To use another analogy, many people these days (me included) would have 
difficulties identifying Fraktur letters, in particular if they show up 
just as individual letters. Similar for many fantasy fonts, and for 
people not very familiar with the Latin script.




Underlying ligature difference is indicative of character identity. 
Particularly when two resulting ligatures are SO different from one another as 
to be unrecognizable. And that is the case with EW on the left and OI on the 
right here:
https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg

The lower two letterforms are in no way “glyph variants” of the upper two 
letterforms. Apart from the stroke of the SHORT I І they share nothing in 
common — because they come from different sources and are therefore different 
characters.


The range of what can be a glyph variant is quite wide across scripts 
and font styles. Just that the shapes differ widely, or that the origin 
is different, doesn't make this conclusive.




Character origin is intimately related to character identity.


In most cases, yes. But it's not a given conclusion.



I don’t think that ANY user of Deseret is all that “average”. Certainly some 
users of Deseret are experts interested in the script origin, dating, 
variation, and so on — just as we have medievalists who do the same kind of 
work. I’m about to publish a volume full of characters from Latin Extended-D. 
My work would have been impossible had we not encoded those characters.


No, your work wouldn't be impossible. It might be quite a bit more 
difficult, but not impossible. I have written papers about Han 
ideographs and Japanese text processing where I had to create my own 
fonts (8-bit, with mostly random assignments of characters because these 
were one-off jobs), or fake things with inline bitmap images (trying to 
get information on the final printer resolution and how many black 
pixels wide a stem or crossbar would have to be to avoid dropouts, and 
not being very successful).


I have heard the argument that some character variant is needed because 
of research, history,... quite a few times. If

Re: Standaridized variation sequences for the Desert alphabet?

2017-03-28 Thread Martin J. Dürst

On 2017/03/28 01:03, Michael Everson wrote:

On 27 Mar 2017, at 16:56, John H. Jenkins  wrote:



The 1857 St Louis punches definitely included both the 1855 EW Ч and the 1859 OI 
<ЃІ>. Ken Beesley shows them in smoke proofs in his 2004 paper on Metafont.


Good to have some actual examples. However, the example at hand does, as 
far as I understand it, not necessarily support separate encoding.


While it mixes 1855 and 1859, it contains only one of the ligature 
variants each. Indeed, it could be taken as support for the theory that 
the top and bottom row ligatures in 
https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg 
were used interchangeably, and that the 1857 St Louis punches just made 
one particular choice of glyph selection.


What would give a strong argument would be the *concurrent* existence of 
*corresponding* ligatures in the same font, or the concurrent (even 
better, contrasting) use of corresponding ligatures in the same text.


Regards,   Martin.

What's interesting (weird?) is that the "1859" OI <ЃІ> appears in 1857 
punches. Time travel? Or is the label "1859" a misnomer or just a 
convention?


Re: Standaridized variation sequences for the Desert alphabet?

2017-03-27 Thread Martin J. Dürst

On 2017/03/24 23:37, Michael Everson wrote:

On 24 Mar 2017, at 11:34, Martin J. Dürst <due...@it.aoyama.ac.jp> wrote:


On 2017/03/23 22:48, Michael Everson wrote:


Indeed I would say to John Jenkins and Ken Beesley that the richness of the 
history of the Deseret alphabet would be impoverished by treating the 1859 
letters as identical to the 1855 letters.


Well, I might be completely wrong, but John Jenkins may be the person on this 
list closest to an actual user of Deseret (John, please correct me if I'm wrong 
one way or another).


He is. He transcribes texts into Deseret. I’ve published three of them (Alice, 
Looking-Glass, and Snark).


Great to know. Given that, I'd assume that you'd take his input a bit 
more serious. Here's what he wrote:


>>>>
My own take on this is "absolutely not." This is a font issue, pure and 
simple. There is no dispute as to the identity of the characters in 
question, just their appearance.


In any event, these two letters were never part of the "standard" 
Deseret Alphabet used in printed materials. To the extent they were 
used, it was in hand-written material only, where you're going to see a 
fair amount of variation anyway. There were also two recensions of the 
DA used in printed materials which are materially different, and those 
would best be handled via fonts.


It isn't unreasonable to suggest we change the glyphs we use in the 
Standard. Ken Beesley and I have have discussed the possibility, and we 
both feel that it's very much on the table.

>>>>



It may be that actual users of Deseret read these character variants the same 
way most of us would read serif vs. sans-serif variants: I.e. unless we are 
designers or typographers, we don't actually consciously notice the difference.


I am a designer and typographer, and I’ve worked rather extensively with a 
variety of Deseret fonts for my publications. They have been well-received.


That's fine, and not disputed at all. That's exactly why I'm looking for 
input from other people.


As an analogy, assume we had a famous type designer coming to this list 
and request that we encode old-style digits separately from roman 
digits, e.g. arguing that this might simplify the production of fonts.


We would understand this request, but we would still deny it because 
based on our day-to-day use of digits, we would understand that at large 
(i.e. for the average user) the convenience of having only one code 
point for a given digit weights stronger than the convenience of 
separate code points for the type designer.


We are looking for similar input from "average users" for Deseret.



If that's the case, it would be utterly annoying to these actual users to have 
to make a distinction between two characters where there actually is none.


Actually neither of the ligature-letters are used in our Carrollian Deseret 
volumes.


Ok. That means that these don't provide any information on the 
discussion at hand (whether to unify or disunify the ligature shapes).




The richness of the history of the Deseret alphabet can still be preserved e.g. 
with different fonts the same way we have thousands of different fonts for 
Latin and many other scripts that show a lot of rich history.


You know, Martin, I *have* been doing this for the last two decades. I’m well 
aware of what a font is and can do.


Great. So you know that present-day font technology would allow us to 
handle the different shapes in at least any of the following ways:


1) Separate characters for separate shapes, both shapes in same font
2) Variant selectors, one or both shapes in same font
3) Font features (e.g. 1855 vs. 1859) to select shapes in the same font
4) Font selection, different fonts for different shapes

Does that knowledge in any way suggest one particular solution?



I’m also aware of what principles we have used for determining character 
identity.


Which, as we have been working out in other mails, are indeed a 
collection of principles, one of which is history of shape derivation.




I saw your note about CJK. Unification there typically has something to do with 
character origin and similarity. The Deseret diphthong letters are clearly 
based on ligatures of *different* characters.


One of the principles of CJK unification is that minor differences are 
ignored if they are not semantically relevant. For CJK, 'minor' is 
important, because otherwise, many users wouldn't be able to recognize 
the shapes as having the same semantics/usage.


The qualification 'minor' is less important for an alphabet. In general, 
the more established and well-known an alphabet is, the wider the 
variations of glyph shapes that may be tolerated. The question I'm 
trying to get an answer for for Deseret is whether current actual script 
users see the shape variation as just substitutable glyphs of the same 
letter, or inherently different letters.


The answer to this question is not 

Re: Standaridized variation sequences for the Desert alphabet?

2017-03-27 Thread Martin J. Dürst

On 2017/03/27 01:20, Michael Everson wrote:

On 26 Mar 2017, at 16:45, Asmus Freytag  wrote:



Consider 2EBC ⺼ CJK RADICAL MEAT and 2E9D ⺝ CJK RADICAL MOON which are 
apparently really supposed to have identical glyphs, though we use an 
old-fashioned style in the charts for the former. (Yes, I am of course aware 
that there are other reasons for distinguishing these, but as far as glyphs go, 
even our standard distinguishes them artificially.)


"apparently", maybe. Let's for a moment leave aside the radicals 
themselves, which are to a large extent artificial constructs. Let's 
look at the actual characters with these radicals (e.g. U+6709,... for 
MOON and U+808A,... for MEAT), in the multi-column code charts of ISO 
10646. There are some exceptions, but in most cases, the G/J/K columns 
show no difference (i.e. always the ⺝ shape, with two horizontal bars), 
whereas the H/T/V columns show the ⺼ shape (two downwards slanted bars) 
for the "MEAT" radical and the ⺝ shape for the moon radical. So whether 
these radicals have identical glyphs depends on typographic 
tradition/font/... In Japan, many people may be rather unaware of the 
difference, whereas in Taiwan, it may be that school children get 
drilled on the difference.




One practical consequence of changing the chart glyphs now, for instance, would 
be that it would invalidate every existing Deseret font. Adding new characters 
would not.


Independent of whether the chart glyphs get changed, couldn't we just 
add a note "also # in some fonts" (where # is the other variant). That 
would make sure that nobody could claim "this font is wrong" based on 
the charts. (Even if a general claim that the chart glyphs aren't 
normative applies to all charts anyway.)




In fact, it would seem that if a Deseret text was encoded in one of the two 
systems, changing to a different font would have the attractive property of 
preserving the content of the text (while not preserving the appearance).


Changing to a different font in order to change one or two glyphs is a 
mechanism that we have actually rejected many times in the past. We have 
encoded variant and alternate characters for many scripts.


Well, yes, rejected many times in cases where that was appropriate. But 
also accepted many times, in cases that we may not even remember, 
because they may not even have been made explicitly. Because in such 
cases, the focus may not be on a change to one or a few letter shapes, 
but the focus may be on a change of the overall style, which induces a 
change of letter shape in some letters. The roman/italic a/ɑ and g/ɡ 
distinctions (the later code points only used to show the distinction in 
plain text, which could as well be done descriptively), as well as a 
large number of distinctions in Han fonts, come to my mind. I'm quite 
sure other scripts have similar phenomena.




This, in a nutshell, is the criterion for making something a font difference 
vs. an encoding distinction.


Character identity is not defined by any single criterion. Moreover, in 
Deseret, it is not the case that all texts which contain the diphthong /juː/ or 
/ɔɪ/ write it using EW Ч or OI Ц. Many write them as Y + U ЏЋ and O + I ЄІ. So 
the choice is one of *spelling*, and spelling has always been a primary 
criterion for such decisions.


This is interesting information. You are saying that in actual practice, 
there is a choice between writing ЄІ (two letters for a diphthong) and 
writing Ч. In the same location, is ІЋ (the base for the historically 
later shape variant of Ч; please note that this may actually be written 
ЋІ; there's some inconsistency in order between the above cited 
sentence and the text below copied from an earlier mail) also used as a 
spelling variant? Overall, we may have up to four variants, of which 
three are currently explicitly supported in Unicode. Are all of these 
used as spelling variants? Is the choice of variant up to the author 
(for which variants), or is it the editor or printer who makes the 
choice (for which variants)? And what informs this choice? If we have 
any historic metal types, are there examples where a font contains both 
ligature variants?


(Please note that because Є, І, and Ћ are available as individual 
letters, it's very difficult to think about the two-letter sequences as 
anything else than spellings, but that doesn't necessarily carry over to 
the ligatures.)


And then the same questions, with parallel (or not parallel) answers, 
for ɒɪ/ɔɪ/Ц.


Regards,Martin.


Text copied from earlier mail by Michael:


1. The 1855 glyph for Ч EW is evidently a ligature of the glyph for the 
diagonal stroke of the glyph for І SHORT I [ɪ] and Ѕ LONG OO [uː], 
that is, [ɪ] + [oː] = [ɪuː], that is, [ju].


2. The 1855 glyph for Ц OI is evidently a ligature of the glyph for Љ 
SHORT AH [ɒ] and the diagonal stroke of the glyph for І SHORT I [ɪ], 
that is, [ɒ] + [ɪ] = [ɒɪ], that is, [ɔɪ].


That’s encoded. Now 

Re: Standaridized variation sequences for the Desert alphabet?

2017-03-26 Thread Martin J. Dürst

On 2017/03/26 22:15, Michael Everson wrote:



On 26 Mar 2017, at 09:12, Martin J. Dürst <due...@it.aoyama.ac.jp> wrote:


Thats a good point: any disunification requires showing examples of
contrasting uses.


Fully agreed.


The default position is NOT “everything is encoded unified until disunified”.


Neither it's "everything is encoded separately unless it's unified".



The characters in question have different and undisputed origins, undisputed.


If you change that to the somewhat more neutral "the shapes in question 
have different and undisputed origins", then I'm with you. I actually 
have said as much (in different words) in an earlier post.




We’ve encoded one pair; evidently this pair was deprecated and another pair was 
devised. The letters wynn and w are also used for the same thing. They too have 
different origins and are encoded separately. The letters yogh and ezh have 
different origins and are encoded separately. (These are not perfect analogies, 
but they are pertinent.)


Fine. I (and others) have also given quite a few analogies, none of them 
perfect, but most if not all of them pertinent.




We haven't yet heard of any contrasting uses for the letter shapes we are 
discussing.


Contrasting use is NOT the only criterion we apply when establishing the 
characterhood of characters.


Sorry, but where did I say that it's the only criterion? I don't think 
it's the only criterion. On the other hand, I also don't think that 
historical origin is or should be the only criterion.


Unfortunately, much of what you wrote gave me the impression that you 
may think that historical origin is the only criterion, or a criterion 
that trumps all others. If you don't think so, it would be good if you 
could confirm this. If you think so, it would be good to know why.




Please try to remember that. (It’s a bit shocking to have to remind people of 
this.


You don't have to remind me, at least. I have mentioned "usability for 
average users in average contexts" and "contrasting use" as criteria, 
and I have also in earlier mail acknowledged history as a (not the) 
criterion, and have mentioned legacy/roundtrip issues. I'm sure there 
are others.



Regards,   Martin.


Re: Diaeresis vs. umlaut (was: Re: Standaridized variation sequences for the Desert alphabet?)

2017-03-26 Thread Martin J. Dürst

On 2017/03/25 03:33, Doug Ewell wrote:

Philippe Verdy wrote:


But Unicode just prefered to keep the roundtrip compatiblity with
earlier 8-bit encodings (including existing ISO 8859 and DIN
standards) so that "ü" in German and French also have the same
canonical decomposition even if the diacritic is a diaeresis in French
and an umlaut in German, with different semantics and origins.


Was this only about compatibility, or perhaps also that the two signs
look identical and that disunifying them would have caused endless
confusion and misuse among users?


I'm not sure to what extent this was explicitly discussed when Unicode 
was created. The fact that the first 256 code points are identical to 
those in ISO-8859-1 was used as a big selling point when Unicode was 
first introduced. It may well have been that for Unicode, there was no 
discussion at all in this area, because ISO-8859-1 was already so well 
established.


And for ISO-8859-1, space was an important concern. Ideally, both 
Islandic and Turkish (and the letters missed for French) would have been 
covered, but that wasn't possible. Disunifying diaeresis and umlaut 
would have been an unaffordable luxury.


The above reasons mask any inherent reasons for why diaeresis and umlaut 
would have been unified or not if the decision had been argued purely 
"on the merit". But having used both German and French, and e.g. looking 
at the situation in Switzerland, where it was important to be able to 
write both French and German on the same typewriter, I would definitely 
argue that disunifying them would have caused endless

confusion and errors among users.

Also, it was argued a few mails ago that diaeresis and umlaut don't look 
exactly the same. I remember well that when Apple introduced its first 
laser printers, there were widespread complaints that the fonts (was it 
Helvetica, Times Roman, and Palatino?) unified away the traditional 
differences in the cuts of these typefaces for different languages.


So to quite some extent, in the relevant period (i.e. 1970ies/80ies), 
the differences between diaeresis and umlaut may be due to design 
differences in the cuts for different languages (e.g. French and 
German). Nobody would have disunified some basic letters because they 
may have looked slightly different in cuts for different languages, and 
so people may also have been just fine with unifying diaeresis and 
umlaut. (German fonts e.g. may have contained a 'ë' for use e.g. with 
"Citroën", but the dots on that 'ë' will have been the same shape as 
'ä', 'ö', and 'ü' umlauts for design consistency, and the other way 
round for French).


Regards,   Martin.


Re: Standaridized variation sequences for the Desert alphabet?

2017-03-26 Thread Martin J. Dürst

On 2017/03/26 11:24, Philippe Verdy wrote:


Thats a good point: any disunification requires showing examples of
contrasting uses.


Fully agreed. We haven't yet heard of any contrasting uses for the 
letter shapes we are discussing.



Now depending on individual publications, authors would
use one character or the other according to their choice, and the encoding
will respect it. If we need further unification for matching texts in the
samer language across periods of time or authors, collation (UCA) can
provide help: this is already what it does in modern German with the digram
"ae" and the letter "ä" which are orthographic variants not distinguished
by the language but by authors' preference.


Well, in most cases, but not e.g. for names. Goethe is not spelled Göthe.

Regards,   Martin.


Re: Standaridized variation sequences for the Deseret alphabet?

2017-03-24 Thread Martin J. Dürst

On 2017/03/23 22:32, Michael Everson wrote:


What is right for Deseret has to be decided by and for Deseret users, rather 
than by script historians.


Odd. That view doesn’t seem to be applicable to CJK unification.


Well, it may not seem to you, but actually it is. I have had a lot of 
discussions with Japanese and others about Han unification (mostly in 
the '90ies), and have studied the history and principles of Han 
unification in quite some detail.


To summarize it, Han unification unifies very much exactly those cases 
where an average user, in average texts, would consider two forms "the 
same" (i.e. exchangeable). Exceptions are due to the round trip rule. It 
also separates very much exactly those cases where an average user, for 
average texts, may not consider two forms equivalent.


If necessary, I can go into further details, but I would have to dig 
quite deeply for some of the sources.


Regards,   Martin.


Re: Standaridized variation sequences for the Desert alphabet?

2017-03-24 Thread Martin J. Dürst

On 2017/03/23 22:48, Michael Everson wrote:


Indeed I would say to John Jenkins and Ken Beesley that the richness of the 
history of the Deseret alphabet would be impoverished by treating the 1859 
letters as identical to the 1855 letters.


Well, I might be completely wrong, but John Jenkins may be the person on 
this list closest to an actual user of Deseret (John, please correct me 
if I'm wrong one way or another).


It may be that actual users of Deseret read these character variants the 
same way most of us would read serif vs. sans-serif variants: I.e. 
unless we are designers or typographers, we don't actually consciously 
notice the difference. If that's the case, it would be utterly annoying 
to these actual users to have to make a distinction between two 
characters where there actually is none.


The richness of the history of the Deseret alphabet can still be 
preserved e.g. with different fonts the same way we have thousands of 
different fonts for Latin and many other scripts that show a lot of rich 
history.


Regards,   Martin.


Re: Standaridized variation sequences for the Deseret alphabet?

2017-03-23 Thread Martin J. Dürst

Hello Michael, others,

[Fixed script name in subject.]

On 2017/03/23 09:03, Michael Everson wrote:

On 22 Mar 2017, at 21:39, David Starner  wrote:



There's the same characters here, written in different ways.


No, it’s not. Its the same diphthong (a sound) written with different letters.


I think this may well be the *historically* correct analysis. And that 
may have some influence on how to encode this, but it shouldn't be dominant.


What's most important is (past and) *current use*. If the distinction is 
an orthographic one (e.g. different words being written with different 
shapes), then that's definitely a good indication for splitting.


On the other hand, if fonts (before/outside Unicode) only include one 
variant at the time, if people read over the variant without much ado, 
if people would be surprised to find both corresponding variants in one 
and the same text (absent font variations), if there are examples where 
e.g. the variant is adjusted in quotes from texts that used the 'old' 
variant inside a text with the 'new' variants, and so on, then all these 
would be good indications that this is, for actual usage purposes, just 
a font difference, and should therefore best be handled as such.


The closes to the current case that I was able to find was the German ß. 
It has roots in both an ss and an sz (to be precise, an ſs and an ſz) 
ligature (see https://en.wikipedia.org/wiki/ß). And indeed in some 
fonts, its right part looks more like an s, and in other fonts more like 
a z (and in lower case, more often like an s, but in upper case, much 
more like a (cursive) Z). Nevertheless, there is only one character (or 
two if you count upper case) encoded, because anything else would be 
highly confusing to virtually all users.


What is right for Deseret has to be decided by and for Deseret users, 
rather than by script historians.


Regards,   Martin.


The glyphs may come from a different origin, but it's encoding the same idea.


We don’t encode diphthongs. We encode the elements of writing systems. The 
“idea” here is represented by one ligature of І + Ѕ (1855 EW), one ligature of 
І + Ћ (1859 EW), one ligature of Љ + І (1855 OI), and one ligature of Ѓ + І 
(1859 OI).

Those ligatures are not glyph variants of one another. You might as well say 
that Æ and Œ are glyph variants of one another.


If a user community considers them separate, then they should be separated, but 
I don't see that happening, and from an idealistic perspective, I think they're 
platonically the same.


I do not agree with that analysis. The ligatures and their constituent parts 
are distinct and distinctive. In fact, it might have been that the choice for 
revision was to improve the underlying phonology. In any case, there’s no way 
that the bottom pair in 
https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg
 can be considered to be “glyph variants” of the top pair. Usage is one thing. 
Character identity is another. Æ is not Œ. A ligature of І + Ѕ is not a 
ligature of І + Ћ.

Michael Everson
.


Re: Superscript and Subscript Characters in General Use

2017-01-11 Thread Martin J. Dürst

On 2017/01/11 17:32, Richard Wordingham wrote:


The truly straight Unicode approach in HTML is to use 1945.
Just entering those 5 characters into a text entry box in Firefox gave
me a properly formatted vulgar fraction.  That is how vulgar fractions
are supposed to work.  Unfortunately, one may need to avoid 'exciting
new fonts' in favour of those with a large, working repertoire.


Just for the record: The vulgar fraction display also happened in 
Thunderbird (on Windows). Firefox and Thunderbird use the same display 
engine. I have switched HTML display off, because I prefer to read all 
my mail in plain text, but it still worked.


Regards,   Martin.


Re: WAP Pictogram Specification as Emoji Source

2017-01-06 Thread Martin J. Dürst

On 2017/01/07 08:21, Christoph Päper wrote:

I just discovered the WAP Pictogram specification (WAP-213-WAPInterPic), last 
published in April 2001 and updated in November 2001.



I haven’t found any reference or vendor-specific images, by the way, and if it 
wasn’t just used as an example domain anyway, pict.com seems now defunct.


Isn't WAP overall pretty much defunct these days?

(Well, many including me predicted as much pretty much when it first 
showed up.)


Regards,   Martin.


Re: IdnaTest.txt and RFC 5893

2017-01-04 Thread Martin J. Dürst

Hello Alastair,

On 2016/12/06 20:51, Alastair Houghton wrote:

Hi all,

I must be missing something; in IdnaTest.txt, in the BIDI TESTS section, there 
are examples like (line 74)


Can you tell us where you got IdnaTest.txt from?


  B;0à.\u05D0;  ;   xn--0-sfa.xn--4db   #   0à.א

which the file alleges are valid, but I cannot for the life of me see why.  
First, “0à.א” is clearly a “Bidi domain name” since it has at least one RTL 
label, “א”.  As such, the Bidi Rule (RFC 5893 section 2) should be applied to 
its labels, and the label “0à” fails [B1], since the first character has Bidi 
property EN, not L, R or AL.


On first sight, it looks to me as if you're correct.

For the exact interpretation of RFC 5893, you'd better write to the 
mailing list of the former IDNA(bis) WG at idna-upd...@alvestrand.no.


Regards,   Martin.


Similarly (line 93)

  B;àˇ.\u05D0;  ;   xn--0ca88g.xn--4db  #   àˇ.א

Again, “àˇ.א” is clearly a “Bidi domain name”, but “àˇ” fails [B6], because “ˇ” 
has Bidi property ON, not L, EN or NSM.

Have I misunderstood something fundamental here?  Could someone explain why 
those examples are valid, in spite of RFC 5893?

Kind regards,

Alastair.

--
http://alastairs-place.net


.



--
Prof. Dr.sc. Martin J. Dürst
Department of Intelligent Information Technology
College of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan


Re: Best practices for replacing UTF-8 overlongs

2016-12-19 Thread Martin J. Dürst

On 2016/12/20 11:35, Tex Texin wrote:

Shawn,

Ok, but that begs the questions of what to do...
"All bets are off" is not instructive.


Well, it may be instructive in that its difficult to get software to 
decide what happened. A human may be in a better position to analyze the 
error and the cause(s) of the error, and to fix these.



How software behaves in the face of invalid bytes, what it does with them, what 
it does about them, and how it continues (or not) still needs to be determined.


Yes, but that will depend on circumstances. In a safety-critical 
application, you'll want to do something different than if you are 
sending the text to a printer just to have a look at it.


Regards,   Martin.


Re: Mixed-Script confusables in prog.languages

2016-12-05 Thread Martin J. Dürst

On 2016/12/05 04:07, Philippe Verdy wrote:


In more technical programming languages however, you can usually be much
more restrictive as the identifiers used are generally abbreviated and
simplified: you can kill lettercase differences for example,


In some languages maybe. But languages such as perl, C, Java, Ruby, 
Python, and so on distinguish case. Ruby starts constants (incl. class 
names) with Upper case, and variable names with lower case, so it needs 
this distinction more than e.g. C, where such distinctions may be used 
as conventions, but are not enforced by the language.


Anyway, my guess is that non-latin variable names will mostly be used in 
education and otherwise locally restricted circumstances (e.g. 
government projects), so I think that makes the chances of spoofing 
(other than self-spoofing) pretty low.


Regards,   Martin.


Re: Mixed-Script confusables in prog.languages

2016-12-05 Thread Martin J. Dürst



On 2016/12/05 17:31, Reini Urban wrote:


ψ_S contains Greek U+03C8, Common and Latin. Since Latin and Common are always 
allowed, the only
new script is Greek. The first non-default script is automatically and silently 
allowed, only a mix with another
non-default script, such as Cyrillic would error or need an explicit 
declaration.

So ψ_S alone is fine, if everything else is Greek.
But mixing with the Cyrillic version would lead to an error.


Allowing mixing of Greek and Latin (or Cyrillic and Latin) would be a 
big problem. As an example, it would allow A_Α (the second letter is a 
Greek one).



Amharic is not defined as UCD script property. It’s alphabet is called Ge’ez, 
which we call
Ethiopic in the UCD. But that’s all I know. I’m not a domain expert. Does 
Ethiopic uses
other Semitic scripts in its alphabet or is it complete?


It's complete. I have never heard that it would need Arabic or Hebrew or 
some such.





How about the many Indian scripts? Do they mix?
Being an indian movie expert tells me that indian languages usually don’t mix.
They make Tamil and Bengali versions of Hindi movies, and usually fall back to 
english to
get common points across the barrier. But their scripts? No idea.


I don't think they mix two different scripts in the same word. Would be 
very confusing.




In the examples in perl which partially came from parrot there’s a wild 
eclectic mix of various scripts
which do make no sense at all. So I don’t know if I can trust those tests, that 
they make sense and
are readable at all. My guess is that the authors just liked code golfing and 
picked random unicode
characters. It’s from perl after all.

Such as this perl test t/mro/isa_c3_utf8.t

use utf8 qw( Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam Hiragana );

...
package 캎oẃ;
package urḲḵk;
@urḲḵk::ISA = 'kഌoんḰ';
package к;
@urḲḵk::ISA = ('kഌoんḰ', '캎oẃ');
package ṭ화ckэ;
...

These identifiers are unreadable, because I don’t assume that anybody will be 
able to understand
Hangul Cyrillic Ethiopic Canadian_Aboriginal Malayalam and Hiragana at once.
I understand a bit Hangul, Cyrillic and Hiragana, but the mix sounds highly 
illegal to me.


The mixes aren't illegal, in that they are not against any law. But they 
are complete intellegible garbage anyway.


Regards,   Martin.


Re: "Oh that's what you meant!: reducing emoji misunderstanding"

2016-11-18 Thread Martin J. Dürst
In many cases, emoji communication is a lot more complicated than just 
copying word order from the host language. See e.g.

https://www.wired.com/2016/08/how-teens-use-social-media/ for some examples.

Regards,   Martin.

On 2016/11/18 18:26, Andre Schappo wrote:


 As Richard Ishida insightfully points out — should Emoji sequences/phrases/sentences 
adhere to the human language context eg a Japanese Emoji sequence could/should be in 
Japanese "Subject - Object - Verb" order 
https://twitter.com/r12a/status/798151134963757056

André Schappo

On 18 Nov 2016, at 07:40, Philippe Verdy 
<verd...@wanadoo.fr<mailto:verd...@wanadoo.fr>> wrote:

I would even add the Emojis are in fact a new separate language, written with 
its own script, its own grammar/syntax, and its specific layout and 
combinations (ligatured clusters, partly documented in Unicode) and sometimes 
specificities about colors of rendering (e.g. the human skin colors, or 
national flags if they are colorized).

I think it would merit a language code for itself. But you could use some special 
language codes for notations, if "zxx" (no lingusitic content) is not 
appropriate. (same remark about musical notations)

2016-11-18 7:06 GMT+01:00 James Kass 
<jameskass...@gmail.com<mailto:jameskass...@gmail.com>>:

Philippe Verdy wrote,


There's no evident and universal way to convert
emojis to natural language ...


Indeed.  Emoji characters apparently mean whatever their users want them to 
mean.  Such meanings may be perceived differently by various users or  
communities, as the subject line indicates, and these meanings are subject to 
change without notice.  Any effort to standardize such a conversion seems 
doomed, but someone with funding would probably try it anyway.

Best regards,

James Kass





--
Prof. Dr.sc. Martin J. Dürst
Department of Intelligent Information Technology
College of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan


Re: Possible to add new precomposed characters for local language in Togo?

2016-11-15 Thread Martin J. Dürst

Hello Marcel,

On 2016/11/12 07:35, Marcel Schneider wrote:


For lack of anything better, and faced with Microsoftʼs one weekʼs silence, I
now suggest to make a wider use of the Vietnamese text representation scheme
that Microsoft implemented for Vietnamese, that is documented in TUS [1], and
that might be of wider interest for all tone mark using languages, including
but not limited to Ga and other languages of Togo and other countries of Africa,
and Lithuanian:

— Vowels with diacritics that are not tone marks, e. g. 6 out of the 12 
Vietnamese
vowels as shown in Figure 7-3. of TUS 9.0 [2] are represented in NFC and entered
either with live keys or with a dead key - live key combination;



[1] The Unicode Standard 9.0, ch. 7 Europe-I, §7.1 Latin, sh. Vietnamese:
http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G19663

[2] http://www.unicode.org/versions/Unicode9.0.0/ch07.pdf#G17544


I'm sorry, but I didn't get the fragment identifiers (#G19663, #G17544) 
to work. Can you tell me which pages/paragraphs you refer to here?


Thanks and regards,   Martin.


  1   2   3   >