Re: Why do binary files contain text but text files don't contain binary?

2020-02-21 Thread Ken Whistler via Unicode


On 2/21/2020 7:53 AM, Costello, Roger L. via Unicode wrote:


Text files may indeed contain binary (i.e., bytes that are not 
interpretable as characters). Namely, text files may contain newlines, 
tabs, and some other invisible things.


Question: "characters" are defined as only the visible things, right?

No. You've gone astray right there. Please read Chapter 2 of the Unicode 
Standard, and in particular, Section 2.4, Code Points and Characters:


https://www.unicode.org/versions/Unicode12.0.0/ch02.pdf#G25564

All of those types of characters can occur in Unicode plain text. (With 
the exception of surrogate code points.)



I conclude:

Binary files may contain arbitrary text.


Binary files can contain *whatever*, including text.


Text files may contain binary, but only a restricted set of binary.

The distinction is definitional. A text file contains *only* characters, 
interpretable by a specific character encoding (usually Unicode, these 
days).


But a text file need not be "plain text". An HTML file is an example of 
a text file (it contains only a sequence of characters, whose identity 
and interpretation is all clearly specified by looking them up in the 
Unicode Standard), but it is not *plain* text. It is *rich* text, 
consisting of markup tags interspersed with runs of plain text.


Another distinction that may be leading you astray is the distinction 
between binary file transfer and text file transfer. If you are using 
ftp, for example, you can specify use of binary file transfer, *even if* 
the file you are transferring is actually a text file. That simply means 
that the file transfer will agree to treat the entire file as a binary 
blob and transfer it byte-for-byte intact. A text file transfer, on the 
other hand, may look for "lines" in a text file and may adjust line 
endings to suit the receiving platform conventions.



Do you agree?


No.

--Ken



Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Ken Whistler via Unicode
Well, no, in this case "strange" means strange, as Ken Lunde notes. I'm 
just pointing to his list, because it pulls together quite a few Han 
characters that *also* have dubious cases for encoding.


Or you could turn the argument around, I suppose, and note that just 
because the hieroglyph for "Egyptologist" is strange, that doesn't 
necessarily mean that the case for encoding it is dubious. ;-)


--Ken

On 2/13/2020 3:47 PM, j...@koremail.com wrote:
An interesting comparison, if strange means dubious, then the name 
kstrange should be changed or some of the content removed because many 
of the characters in the set are not dubious in the least.




Re: Egyptian Hieroglyph Man with a Laptop

2020-02-13 Thread Ken Whistler via Unicode

You want "dubious"?!

You should see the hundreds of strange characters already encoded in the 
CJK *Unified* Ideographs blocks, as recently documented in great detail 
by Ken Lunde:


https://www.unicode.org/L2/L2020/20059-unihan-kstrange-update.pdf

Compared to many of those, a hieroglyph of a man (or woman) holding a 
laptop is positively orthodox!


--Ken

On 2/13/2020 11:47 AM, Phake Nick via Unicode wrote:
Those characters could also be put into another block for the same 
script similar to how dubious characters in CJK are included by 
placing them into "CJK Compatibility Ideographs" for round trip 
compatibility with source encoding.


Re: Combining Marks and Variation Selectors

2020-02-02 Thread Ken Whistler via Unicode

Richard,

What it comes down to is avoidance of conundrums involving canonical 
reordering for normalization. The effect of variation selectors is 
defined in terms of an immediate adjacency. If you allowed variation 
selectors to be defined for combining marks of ccc!=0, then 
normalization of sequences could, in principle, move the two apart. That 
would make implementation of the intended rendering much more difficult.


That is basically why the UTC, from the start, ruled out using variation 
selectors to try to make graphic distinctions between different styles 
of acute accent marks explicit, for example.


--Ken

On 2/1/2020 7:30 PM, Richard Wordingham via Unicode wrote:

Ah, I missed that change from Version 5.0, where the restriction was,
'The base character in a variation sequence is never a combining
character or a decomposable character'.  I now need to rephrase the
question.  Why are marks other than spacing marks prohibited?



Re: Adding Experimental Control Characters for Tai Tham

2020-01-29 Thread Ken Whistler via Unicode

Richard,

Given that those particular two variation selectors have already given 
very specific semantics for emoji sequences, and would now be expected 
to occur *only* in emoji sequences:


https://www.unicode.org/reports/tr51/#def_text_presentation_selector

usurping them to do something unrelated would probably not be a good idea.

For experimentation purposes, VS13 and VS14 would be safer.

--Ken

On 1/25/2020 10:41 AM, Richard Wordingham via Unicode wrote:

How inappropriate would it be to usurp a pair of variation selectors
for this purpose?  For mnemonic purposes, I would suggest usurping

FE0E VARIATION SELECTOR-15 for *1A8E TAI THAM SIGN INITIAL
FE0F VARIATION SELECTOR-16 for *1A8F TAI THAM SIGN FINAL


Re: Not accepted by UTC but in ISO ballot?

2019-12-27 Thread Ken Whistler via Unicode

Shriramana,

That category is used to track character(s) in process that may have 
been approved by WG2 but are not yet in ballot, or are in contention, 
and may have just been dropped from ballot, but which still have 
sufficient visibility to be tracked.


The process is a bit rough around the edges when dealing with two 
separate committees with asynchronous processes and not all of whose 
members have unanimous agreement about what they are moving forward on. 
The pipeline is a means of tracking various status as the committees 
work to synchronize their eventual publications of new repertoire.


--Ken

On 12/27/2019 7:06 AM, Shriramana Sharma via Unicode wrote:
Now I'm wondering about the similar category "not accepted by UTC, and 
not in ISO ballot" – why such a character would be mentioned on the 
pipeline at all…


Re: Not accepted by UTC but in ISO ballot?

2019-12-26 Thread Ken Whistler via Unicode

Shriramana,

On 12/20/2019 6:29 PM, Shriramana Sharma via Unicode wrote:

I was looking at the pipeline for something else, and for the first
time I see a character category: “not accepted by the UTC but in ISO
ballot” and two characters in it.
Those two characters changed status as of December 4, when the 
disposition of comments for CD3 was posted. They will not be part of the 
DIS ballot. The pipeline has now been updated to reflect that change of 
status.


So IIUC while technically people are free to submit a document to the
ISO separately without submitting to UTC, it has always been the
practice to my knowledge to get a character approved by the UTC first.


That is a preferred process, but doesn't always occur. The most obvious 
exception is that large new CJK repertoire additions are developed by 
the IRG and often go into ballot in ISO before the UTC takes a formal 
decision to approve them. CJK Extension G has now been approved for 13.0 
by the UTC, but the entire block was listed in the pipeline for some 
time as "not accepted by UTC, but in active ISO technical ballot" once 
Extension G went into CD balloting.


--Ken



Re: HEAVY EQUALS SIGN

2019-12-20 Thread Ken Whistler via Unicode


On 12/20/2019 7:17 AM, wjgo_10...@btinternet.com via Unicode wrote:
It is indeed interesting that the Notice of Non-Approval itself uses 
italics for emphasis in two places.


That text, at the present time, cannot be expressed in Unicode plain 
text with the emphasis that the Notice of Non-Approval includes.


... which was /precisely /the point. I'm glad you noticed.

--Ken



Re: New Public Review on QID emoji

2019-10-30 Thread Ken Whistler via Unicode



On 10/30/2019 10:41 AM, wjgo_10...@btinternet.com via Unicode wrote:


At present I have a question to which I cannot find the answer.

Is the QID emoji format, if approved by the Unicode Technical 
Committee going to be sent to the ISO/IEC 10646 committee for 
consideration by that committee?

No.


As the QID emoji format is in a Unicode Technical Standard and does 
not include the encoding of any new _atomic_ characters, I am 
concerned that the answer to the above question may well be along the 
lines of "No" maybe with some reasoning as to why not.

As you surmised.


Yet will a QID emoji essentially be _de facto_ a character even if not 
_de jure_ a character?
That distinction is effectively meaningless. There are any number of 
entities that end users perceive as "characters", which are not 
represented by a single code point in the Unicode Standard (or 10646) -- 
and this has been the case now for decades.



Yet if QID emoji are implemented by Unicode Inc. without also being 
implemented by ISO/IEC 10646 then that could lead to future problems, 
notwithstanding any _de jure_ situation that QID emoji are not 
characters, because they will be much more than Private Use characters 
yet less than characters that are in ISO/IEC 10646.


What you are missing is that *many* emoji are already represented by 
sequences of characters. See emoji modifier sequences, emoji flag 
sequences, emoji ZWJ sequences. *None* of those are specified in 10646, 
have not been for years now, and never will be. And yet, there is no de 
jure standardization crisis here, or any interoperability issue for 
emoji arising from that situation.




I am in favour of the encoding of the QID emoji mechanism and its 
practical application. However I wonder about what are the 
consequences for interoperability and communication if QID emoji 
become used - maybe quite widely - and yet the tag sequences are not 
discernable in meaning from ISO/IEC 10646 or any related ISO/IEC 
documents.


There may well be interoperability concerns specifically for the QID 
emoji mechanism, but that would be an issue pertaining to the 
architecture of that mechanism specifically. It isn't anything to do 
with the relationship between the Unicode Standard (and UTS #51) and 
ISO/IEC 10646.


--Ken




Re: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode?

2019-10-12 Thread Ken Whistler via Unicode



On 10/12/2019 3:15 AM, Fred Brennan via Unicode wrote:

There seems to be no conscionable reason for such a long delay after the
approval.

If that's just how things are done, fine, I certainly can't change the whole
system. But imagine if you had to wait two years to even have a chance of
using a letter you desperately need to write your language? Imagine if the
letter "Q" was unencoded and Noto refused to add it for two more years?


Well, as long as we are imagining things, then consider a scenario where 
the UTC is presented a proposal for encoding a writing system which is 
reported as an historic artifact of the 18th century, "fallen out of 
normal use", yet encodes it anyway based on the proposal provided in 1999:


https://www.unicode.org/L2/L1999/n1933.pdf

and publishes it in Unicode 3.2 in 2002:

https://www.unicode.org/standard/supported.html

Then imagine that a community works to revive use of that script (now 
known as Baybayin) and extends character use in it based on similar 
characters in related, more contemporaneous scripts, but that the first 
time the UTC actually formally hears about that extension is on July 18, 
2019:


https://www.unicode.org/L2/L2019/19258r-baybayin-ra.pdf

And then imagine that despite a 17 year gap before this supposedly 
urgent defect in an encoding is reported to the UTC, that the UTC in 
fact approves encoding of U+170D TAGALOG LETTER RA at its very *first* 
opportunity, eight days later, on July 26, 2019. Further imagine that 
the UTC immediately publishes what amounts to a "letter of intent" to 
publish this character when it can:


https://www.unicode.org/alloc/Pipeline.html#future

It may then be understandable that some UTC participants might be 
puzzled to be accused of unconscionable delays in this case.


I understand the frustration that you are expressing, but it simply 
isn't feasible for every proposal's advocates to get their particular 
candidates pushed to the front of the line for publication. Unicode 13.0 
is creaking down the track towards its March 10, 2020 publication, but 
it already is contending with 5930 new characters (as well as additional 
emoji sequences beyond that), every one of which was approved by the UTC 
*prior* to July 26, 2019 and all of which are already in some advanced 
stage of ISO ballot consideration.


In the meantime, Baybayin users are inconvenienced, sure, but it is 
unlikely that the interim solutions will just break, because nobody is 
opposed to U+170D TAGALOG LETTER RA, and it is exceedingly unlikely that 
that code point would be moved before its eventual publication in the 
standard in March, 2021.


--Ken




Re: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode?

2019-10-11 Thread Ken Whistler via Unicode

Sorry about the typo there. I meant "the published Version 13.0 next March"

--Ken

On 10/11/2019 10:17 AM, Ken Whistler wrote:
then eventually in the published Version 13.0 next month: 


Re: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode?

2019-10-11 Thread Ken Whistler via Unicode

Short answer is no.

The characters in the pipeline section labeled "Characters Accepted for 
Version 13.0" are what will be in the beta review for 13.0 (look for 
that sometime next month), and then eventually in the published Version 
13.0 next month:


https://www.unicode.org/alloc/Pipeline.html#planned_next_version

Characters listed in the "Characters for Future Versions" table:

https://www.unicode.org/alloc/Pipeline.html#future

are not yet targeted for any particular version. Many of them, including 
the Tagalog letter RA, will end up published in Unicode 14.0, but the 
detailed decisions on what makes it into Unicode 14.0 won't happen until 
sometime next summer.


Production of new versions of the Unicode Standard is a ponderous and 
lengthy operation, involving 4 UTC meetings, uncounted subcommittee 
meetings, dozens of specifications, hundreds of character properties, 
thousands of characters, hundreds of fonts, and intricate charts and QA 
process. It doesn't happen at the drop of a hat, which is why we 
schedule a full year for each new major release.


So, in general, no, you can *never* assume that once the UTC has just 
approved a new character that it will be in the next version of Unicode.


--Ken

On 10/11/2019 4:35 AM, Fred Brennan via Unicode wrote:

Many users are asking me and I'm not sure of the answer (nor how to find it
out).

The UTC approved it, so it will be in the next version of Unicode, right?

We sure hope so...it is a character needed to write a script in current use.
Although only a minority of people care about it, that minority is dedicated!

Best,
Fred Brennan


Re: On the lack of a SQUARE TB glyph

2019-09-27 Thread Ken Whistler via Unicode

Fred,

2 hours and 33 minutes from now (today). But you don't need to try to 
synch a proposal like this to a particular script ad hoc meeting. That 
group meets roughly once a month, and any new proposal coming in right 
now wouldn't be on the Unicode 13.0 train, even if the UTC immediately 
agreed to it. So there isn't an immediately urgent deadline for new 
proposals.


--Ken

On 9/26/2019 10:15 PM, Fred Brennan via Unicode wrote:

When does the Script Ad Hoc meet next?


Re: On the lack of a SQUARE TB glyph

2019-09-26 Thread Ken Whistler via Unicode



On 9/26/2019 4:21 AM, Fred Brennan via Unicode wrote:

There is a clear demand for a SQUARE TB. In the font SMotoya Sinkai W55 W3,
which is ©2008 株式会社 モトヤ, the glyph is unencoded and accessed via the
Discretionary Ligatures (`dlig`) OpenType feature. It has name `T_B.dlig`.


Aye, there's the rub. Despite the subject of this thread, the problem is 
not the lack of a "glyph". This and many other particular squared forms 
may exist in Japanese fonts. The question then devolves to whether there 
is a *character* encoding issue here. What data representation and 
interchange issue is being raised here that requires an atomic character 
encoding, when the *presentation* issue can just be handled with 
OpenType features and already existing characters?


If the concern is about future-proofing the standard, then clearly, 
instead of indefinitely extending various groups of squared combinations 
for SI values, other technical values, etc., etc., the generative and 
scaleable way forward is simply to let Japanese squared sequence 
coinages be handled with OpenType features, rather than insisting that 
each one come back to the UTC for one-by-one character encoding.


Note that there is a certain, systemic similarity here to the problem of 
extensibility of emoji, where encoding of multiple flags, of multiple 
skin tones, or of multiple gender representations, etc., is handled more 
generally by specifying how fonts need to map specified sequences into 
single glyphs, rather than by insisting that every meaningful 
combination end up encoded as an atomic character.


--Ken



Re: PUA (BMP) planned characters HTML tables

2019-08-14 Thread Ken Whistler via Unicode



On 8/14/2019 4:32 PM, James Kass via Unicode wrote:
If a character gets deprecated, can its decomposition type be changed 
from canonical to compatibility?


Simple answer: No.

--Ken



Re: New website

2019-07-22 Thread Ken Whistler via Unicode
Your helpful suggestions will be passed along to the people working on 
the new site.


In the meantime, please note that the link to the "Unicode Technical 
Site" has been added to the left column of quick links in the page 
bottom banner, so it is easily available now from any page on the new site.


--Ken

On 7/22/2019 9:54 AM, Zachary Carpenter wrote:
It seems that many of the concerns expressed here could be resolved 
with a menu link to the “Unicode Technical Site” on the left-hand menu bar


Re: Akkha script (used by Eastern Magar language) in ISO 15924?

2019-07-22 Thread Ken Whistler via Unicode

See the entry for "Magar Akkha" on:

http://linguistics.berkeley.edu/sei/scripts-not-encoded.html

Anshuman Pandey did preliminary research on this in 2011.

http://www.unicode.org/L2/L2011/11144-magar-akkha.pdf

It would be premature to assign an ISO 15924 script code, pending the 
research to determine whether this script should be separately encoded.


--Ken

On 7/22/2019 9:16 AM, Philippe Verdy via Unicode wrote:
According to Ethnolog, the Eastern Magar language (mgp) is written in 
two scripts: Devanagari and "Akkha".


But the "Akkha" script does not seem to have any ISO 15924 code.

The Ethnologue currently assigns a private use code (Qabl) for this 
script.


Was the addition delayed due to lack of evidence (even if this 
language is official in Nepal and India) ?


Did the editors of Ethnologue submit an addition request for that 
script (e.g. for the code "Akkh" or "Akha" ?)


Or is it considered unified with another script that could explain why 
it is not coded ? If this is a variant it could have its own code 
(like Nastaliq in Arabic). Or may be this is just a subset of another 
(Sino-Tibetan) script ?






Access to the Unicode technical site (was: Re: Unicode's got a new logo?)

2019-07-18 Thread Ken Whistler via Unicode



On 7/18/2019 11:50 AM, Steffen Nurpmeso via Unicode wrote:

I also decided to enter /L2 directly from now on.


For folks wishing to access the UTC document register, Unicode 
Consortium standards, and so forth, all of those links will be 
permanently stable. They are not impacted by the rollout of the new home 
page and its related content.


If you need access to the more technical information from the UTC, 
CLDR-TC, ICU-TC, etc., feel free to bookmark such pages as:


https://www.unicode.org/L2/

for the UTC document register.

https://www.unicode.org/charts/

for the Unicode code charts index,

https://www.unicode.org/versions/latest/

for the latest version of the Unicode Standard, and so forth. All such 
technical links are stable on the site, and will continue to be stable.


For general access to the technical content on the Unicode website, see:

https://www.unicode.org/main.html

which provides easy link access to all the technical content areas and 
to the ongoing technical committee work.


--Ken





Re: ISO 15924 : missing indication of support for Syriac variants

2019-07-18 Thread Ken Whistler via Unicode


On 7/17/2019 4:54 PM, Philippe Verdy via Unicode wrote:
then the Unicode version (age) used for Hieroglyphs should also be 
assigned to Hieratic.

It is already.


In fact the ligatures system for the "cursive" Egyptian Hieratic is so 
complex (and may also have its own variants showing its progression 
from Hieroglyphs to Demotic or Old Coptic), that probably Hieratic 
should no longer be considered "unified" with Hieroglyphs, and its 
existing ISO 15924 code is then not represented at all in Unicode.


It *is* considered unified with Egyptian hieroglyphs, until such time as 
anyone would make a serious case that the Unicode Standard (and students 
of the Egyptian hieroglyphs, in both their classic, monumental forms and 
in hieratic) would be better served by a disunification.


Note that *many* cursive forms of scripts are not easily "supported" by 
out-of-the-box plain text implementations, for obvious reasons. And in 
the case of Egyptian hieroglyphs, it would probably be a good strategy 
to first get some experience in implementations/fonts supporting the 
Unicode 12.0 controls for hieroglyphs, before worrying too much about 
what does or doesn't work to represent hieratic texts adequately. 
(Demotic is clearly a different case.)




For now ISO 15924 still does not consider Egyptian Hieratic to be 
"unified" with Egyptian Hieroglyphs; this is not indicated in its 
descriptive names given in English or French with a suffix like 
"(cursive variant of Egyptian Hieroglyphs)", *and it has no "Unicode 
Age" version given, as if it was still not encoded at all by Unicode*,


That latter part of that statement (highlighted) is false, as is easily 
determined by simple inspection of the Egyh entry on:


https://www.unicode.org/iso15924/iso15924-codes.html

--Ken




Re: Unicode "no-op" Character?

2019-07-03 Thread Ken Whistler via Unicode


On 7/3/2019 10:47 AM, Sławomir Osipiuk via Unicode wrote:


Is my idea impossible, useless, or contradictory? Not at all.


What you are proposing is in the realm of higher-level protocols.

You could develop such a protocol, and then write processes that honored 
it, or try to convince others to write processes to honor it. You could 
use PUA characters, or non-characters, or existing control codes -- the 
implications for use of any of those would be slightly different, in 
practice, but in any case would be an HLP.


But your idea is not a feasible part of the Unicode Standard. There are 
no "discardable" characters in Unicode -- *by definition*. The 
discussion of "ignorable" characters in the standard is nuanced and 
complicated, because there are some characters which are carefully 
designed to be transparent to some, well-specified processes, but not to 
others. But no characters in the standard are (or can be) ignorable by 
*all* processes, nor can a "discardable" character ever be defined as 
part of the standard.


The fact that there are a myriad of processes implemented (and 
distributed who knows where) that do 7-bit ASCII (or 8-bit 8859-1) 
conversion to/from UTF-16 by integral type conversion is a simple 
existence proof that U+000F is never, ever, ever, ever going to be 
defined to be "discardable" in the Unicode Standard.


--Ken




Re: acute-macron hybrid?

2019-04-30 Thread Ken Whistler via Unicode



On 4/30/2019 12:45 AM, Julian Bradfield via Unicode wrote:

What is its appropriate Unicode representation?


A macron.

--Ken



Re: Variation Sequences (and L2-11/059)

2019-03-13 Thread Ken Whistler via Unicode



On 3/13/2019 2:42 AM, Janusz S. Bień via Unicode wrote:

Hi!

On Mon, Jul 16 2018 at  7:07 +02, Janusz S. Bień via Unicode wrote:

FAQ (http://unicode.org/faq/vs.html) states:

 For historic scripts, the variation sequence provides a useful tool,
 because it can show mistaken or nonce glyphs and relate them to the
 base character. It can also be used to reflect the views of
 scholars, who may see the relation between the glyphs and base
 characters differently. Also, new variation sequences can be added
 for new variant appearances (and their relation to the base
 characters) as more evidence is discovered.

I'm proof-reading a paper where I quote the above fragment and to my
surprise I noticed it's no longer present in the FAQ.


That text is, in fact, still present on the FAQ page in question:

https://www.unicode.org/faq/vs.html#18



So my question are:

1. Does the change mean the change of the official policy of the
Consortium?


Your premise here, however, is mistaken. The FAQ pages do *not*, and 
never have represented official policy of the Unicode Consortium. The 
individual FAQ entries are contributed by many people -- some 
attributed, and some not. They are updated or added to periodically by 
various editors, in response to feedback, or as old entries grow 
out-dated, or new issues arise. Those updates are editorial, and do not 
reflect any official decision process by Unicode technical committees or 
officers. The FAQ main page itself points out that "The FAQs are 
contributed by many people," and invites the public to submit possible 
new entries for editing and addition to the list of FAQs.


For official technical content, refer to the published technical 
specifications themselves, which are carefully controlled, versioned, 
and archived.


For official policies of the Unicode Consortium, refer to the Unicode 
Consortium policies page, which is also carefully controlled:


https://www.unicode.org/policies/policies.html



2. Are the archival versions of the FAQ available somewhere?


https://web.archive.org/web/*/https://www.unicode.org/faq/




3. Are the changes to the FAQ documented somehow (a version control
system?)?


No.

--Ken



Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Ken Whistler via Unicode

Egmont,

On 2/9/2019 11:48 AM, Egmont Koblinger via Unicode wrote:

Are there any (non-CJK) scripts for which crossword puzzles don't exist?


There are crossword puzzles for Hindi (in the Devanagari script). Just 
do an image search for "Hindi crossword puzzle".


But the conventions for these break up words into syllables fitting into 
the boxes, and the rules for that are complex. You have to allow for the 
placement of dependent vowels, which may take up extra space left or 
right, as well as consonant clusters, which would be expressed often as 
conjuncts in Sanskrit, but which in Hindi are more commonly rendered as 
dead consonant sequences. So the "stuff in a box" is:


1. Inherently proportional width.

2. Inherently multi-character in content. (underlying 1 to 3 or more 
characters per cell)


This is the kind of compromise you would have to have to make for almost 
any Indic script, to enable a rational approach to building crossword 
puzzles that make sense.


And in a terminal context, you probably would not get acceptable 
behavior for Hindi if you tried to just take all the "stuff in a box" 
chunks and tried to lay them out directly in a line, as if the script 
behaved more like CJK.


The existence proof of techniques to cut up text into syllables that 
enable crossword puzzle building, is not the same as a determination 
that the script, ipso facto, would work in a terminal context without 
dealing with additional complex script issues.


At any rate, this is once again straying over into the issue of whether 
terminals can  be adapted for the requirements of shaping rules for 
complex scripts -- rather than the nominal subject of the thread, which 
has to do with bidi text layout in terminals.


--Ken




Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Ken Whistler via Unicode

Richard,

On 2/1/2019 1:30 PM, Richard Wordingham via Unicode wrote:


Language tagging is already available in Unicode, via the tag characters
in the deprecated plane.


Recte:

1. Plane 14 is not a "deprecated plane".

2. The tag characters in Tag Character block (U+E..U+E007F) are not 
deprecated. (They are used, for example, by UTS #51 to specify emoji tag 
sequences.)


3. However, the use of U+E0001 LANGUAGE TAG and the mechanism of using 
tag characters for spelling out language tags are explicitly deprecated 
by the standard. See: "Deprecated Use for Language Tagging" in Section 
23.9 Tag Characters.


https://www.unicode.org/versions/Unicode11.0.0/ch23.pdf#G30427

and PropList.txt:

E0001 ; Deprecated # Cf   LANGUAGE TAG

As I stated earlier: language tags should use BCP 47, and belong in the 
markup level, not in the plain text stream.


--Ken



Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Ken Whistler via Unicode



On 1/31/2019 1:41 AM, Egmont Koblinger via Unicode wrote:

I mean, for
example we can introduce control characters that specify the language.


That is a complete non-starter for the Unicode Standard. And if the 
terminal implementation introduces such as one-off hacks, they will fail 
completely for interoperability.


https://en.wikipedia.org/wiki/IETF_language_tag

That belongs to the markup level, not to the plain text stream.

--Ken




Re: A last missing link for interoperable representation

2019-01-08 Thread Ken Whistler via Unicode

James,

On 1/8/2019 1:11 PM, James Kass via Unicode wrote:
But we're still using typewriter kludges to represent stress in Latin 
script because there is no Unicode plain text solution.


O.k., that one needs a response.

We are still using kludges to represent stress in the Latin script 
because *orthographies* for most languages customarily written with the 
Latin script don't have clear conventions for indicating stress as a 
part of the orthography.


When an orthography has a well-developed convention for indicating 
stress, then we can look at how that convention is represented in the 
plain text representation of that orthography. An obvious case is 
notational systems for the representation of pronunciation of English 
words in dictionaries. Those conventions *do* then have plain text 
representations in Unicode, because, well, they just have various 
additional characters and/or combining marks to clearly indicate lexical 
stress. But standard written English orthography does *not*. (BTW, that 
is in part because marking stress in written English would usually 
*decrease* legibility and the usefulness of the writing, rather than 
improving it.)


Furthermore, there is nothing inherent about *stress* per se in the 
Latin script (or any other script, for that matter). Lexical stress is a 
phonological system, not shared or structured the same way in all 
languages. And there are *thousands* of languages written with the Latin 
script -- with all kinds of phonological systems associated with them. 
Some have lexical tones, some do not. Some have other kinds of 
phonological accentuation systems that don't count as lexical stress, 
per se.


And there are differences between lexical stress (and its indication), 
and other kinds of "stress". Contrastive stress, which is way more 
interesting to consider as a part of writing, IMO, than lexical stress, 
is a *prosodic* phenomenon, not a lexical one. (And I have been using 
the email convention of asterisks here to indicate contrastive stress in 
multiple instances.) And contrastive stress is far from the only kind of 
communicatively significant pitch phenomenon in speech that typically 
isn't formally represented in standard orthographies. There are numerous 
complex scoring systems for linguistic prosody that have been developed 
by linguists interested in those phenomenon -- which include issues of 
pace and rhythm, and not merely pitch contours and loudness.


It isn't the job of the Unicode Consortium or the Unicode Standard to 
sort that stuff out or to standardize characters to represent it. When 
somebody brings to the UTC written examples of established orthographies 
using character conventions that cannot be clearly conveyed in plain 
text with the Unicode characters we already have, *then* perhaps we will 
have something to talk about.


--Ken




Re: The encoding of the Welsh flag

2018-11-21 Thread Ken Whistler via Unicode

Michael,

On 11/21/2018 9:38 AM, Michael Everson via Unicode wrote:

What really annoys me about this is that there is no flag for Northern Ireland. 
The folks at CLDR did not think to ask either the UK or the Irish 
representatives to SC2 about this.


Neither CLDR-TC nor SC2 has any jurisdiction here, so this is rather non 
sequitur.


If you or Andrew West or anyone else is interested in pursuing an emoji 
tag sequence for an emoji flag for Northern Ireland, then that should be 
done by submitting a proposal, with justification, to the Emoji 
Subcommittee, which *does* have jurisdiction.


https://unicode.org/emoji/proposals.html

See in particular, Section M of the selection criteria.

--Ken




Re: The encoding of the Welsh flag

2018-11-21 Thread Ken Whistler via Unicode



On 11/21/2018 8:00 AM, William_J_G Overington via Unicode wrote:

Yet the interoperability does not derive from an International Standard.


The interoperability that enabled your mail to be delivered to me derives in 
part from the MIME standard (RFC 2045 et seq.) which is not an International 
Standard, but is instead maintained by the Networking Working Group of IETF.

The interoperability that enabled me to read the content of your mail derives 
from the HTML standard, which is not an International Standard, but is instead 
maintained by the W3C (a consortium).

The interoperability of any flag emoji embedded in that content derives from 
Unicode Technical Standard #51, which is not an International Standard, but is 
instead maintained by the Unicode Consortium.

These standards are all widely used *internationally*, but they are not an 
International Standard, which is effectively a moniker claimed by ISO for 
itself and its standards.

But in this day and age, expecting all technology, including technology related 
to computational processing, distribution, interchange, and rendering of text, 
to wait around for any related standard to be canonized as an International 
Standard is just silly. The world of technology does not work that way, and 
frankly, folks should be damn glad that it doesn't.

--Ken



Re: The encoding of the Welsh flag

2018-11-20 Thread Ken Whistler via Unicode



On 11/20/2018 12:57 PM, William_J_G Overington via Unicode wrote:

quote

A Unicode Technical Standard (UTS) is an independent specification. Conformance 
to the Unicode Standard does not imply conformance to any UTS.

end quote

My questions are as follows please.

Is that encoding for the Welsh flag included

in both The Unicode Standard and ISO/IEC 10646

or is it only encoded in The Unicode Standard

or is it in neither The Unicode Standard nor ISO/IEC 10646?


Neither.

A flag emoji is represented via a character sequence -- in this 
particular case by an emoji tag sequence, as specified in UTS #51.


The representation of flag emoji via emoji tag sequences is *OUT OF 
SCOPE* for both the Unicode Standard and for ISO/IEC 10646.


If you find that hard to understand, consider another example. The 
spelling of the word "emoji" as the sequence of Unicode characters 
<0065, 006D, 006F, 006A, 0069> is also *OUT OF SCOPE* for both the 
Unicode Standard and for ISO/IEC 10646. Neither standard specifies 
English spelling rules; nor does either standard specify flag emoji 
"spelling rules".




Unless the answer is the first listed possibility, how does that work as 
regards interoperability of sending and receiving a Welsh flag on an electronic 
communication system?


One declares conformance to UTS #51 and declares the version of emoji 
that one's application supports -- including the RGI (recommended for 
general interchange) list of emoji one has input and display support 
for. If the declaration states support for the flags of England, 
Scotland, and Wales, then one must do so via the specified emoji tag 
sequences. Your interoperability derives from that.


--Ken



Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Ken Whistler via Unicode


On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
I was replying not about the notational repreentation of the DUCET 
data table (using [....] unnecessarily) but about the text of 
UTR#10 itself. Which remains highly confusive, and contains completely 
unnecesary steps, and just complicates things with absoiluytely no 
benefit at all by introducing confusion about these "". 


Sorry, Philippe, but the confusion that I am seeing introduced is what 
you are introducing to the unicode list in the course of this discussion.



UTR#10 still does not explicitly state that its use of "" does not 
mean it is a valid "weight", it's a notation only


No, it is explicitly a valid weight. And it is explicitly and 
normatively referred to in the specification of the algorithm. See 
UTS10-D8 (and subsequent definitions), which explicitly depend on a 
definition of "A collation weight whose value is zero." The entire 
statement of what are primary, secondary, tertiary, etc. collation 
elements depends on that definition. And see the tables in Section 3.2, 
which also depend on those definitions.



(but the notation is used for TWO distinct purposes: one is for 
presenting the notation format used in the DUCET


It is *not* just a notation format used in the DUCET -- it is part of 
the normative definitional structure of the algorithm, which then 
percolates down into further definitions and rules and the steps of the 
algorithm.


itself to present how collation elements are structured, the other one 
is for marking the presence of a possible, but not always required, 
encoding of an explicit level separator for encoding sort keys).
That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It 
is not part of the *notation* for collation elements, but instead is a 
magic value chosen for the level separator precisely because zero values 
from the collation elements are removed during sort key construction, so 
that zero is then guaranteed to be a lower value than any remaining 
weight added to the sort key under construction. This part of the 
algorithm is not rocket science, by the way!


UTR#10 is still needlessly confusive.


O.k., if you think so, you then know what to do:

https://www.unicode.org/review/pri385/

and

https://www.unicode.org/reporting.html

Even the example tables can be made without using these "" (for 
example in tables showing how to build sort keys, it can present the 
list of weights splitted in separate columns, one column per level, 
without any "". The implementation does not necessarily have to 
create a buffer containing all weight values in a row, when separate 
buffers for each level is far superior (and even more efficient as it 
can save space in memory).


The UCA doesn't *require* you to do anything particular in your own 
implementation, other than come up with the same results for string 
comparisons. That is clearly stated in the conformance clause of UTS #10.


https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance

The step "S3.2" in the UCA algorithm should not even be there (it is 
made in favor an specific implementation which is not even efficient 
or optimal),


That is a false statement. Step S3.2 is there to provide a clear 
statement of the algorithm, to guarantee correct results for string 
comparison. Section 9 of UTS #10 provides a whole lunch buffet of 
techniques that implementations can choose from to increase the 
efficiency of their implementations, as they deem appropriate. You are 
free to implement as you choose -- including techniques that do not 
require any level separators. You are, however, duly warned in:


https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators

that "While this technique is relatively easy to implement, it can 
interfere with other compression methods."


it complicates the algorithm with absoluytely no benefit at all); you 
can ALWAYS remove it completely and this still generates equivalent 
results.


No you cannot ALWAYS remove it completely. Whether or not your 
implementation can do so, depends on what other techniques you may be 
using to increase performance, store shorter keys, or whatever else may 
be at stake in your optimization.


If you don't like zeroes in collation, be my guest, and ignore them 
completely. Take them out of your tables, and don't use level 
separators. Just make sure you end up with conformant result for 
comparison of strings when you are done. And in the meantime, if you 
want to complain about the text of the specification of UTS #10, then 
provide carefully worded alternatives as suggestions for improvement to 
the text, rather than just endlessly ranting about how the standard is 
confusive because the collation weight  is "unnecessary".


--Ken




Re: A sign/abbreviation for "magister"

2018-10-30 Thread Ken Whistler via Unicode



On 10/30/2018 2:32 PM, James Kass via Unicode wrote:
but we can't seem to agree on how to encode its abbreviation. 


For what it's worth, "mgr" seems to be the usual abbreviation in Polish 
for it.


--Ken



Re: A sign/abbreviation for "magister"

2018-10-29 Thread Ken Whistler via Unicode



On 10/29/2018 8:06 PM, James Kass via Unicode wrote:
could be typed on old-style mechanical typewriters.  Quintessential 
plain-text, that.


Nope. Typewriters were regularly used for underscoring and for 
strikethrough, both of which are *styling* of text, and not plain text. 
The mere fact that some visual aspect of graphic representation on a 
page of paper can be implemented via a mechanical typewriter does not, 
ipso facto, mean that particular feature is plain text. The fact that I 
could also implement superscripting and subscripting on a mechanical 
typewriter via turning the platen up and down half a line, also does not 
make *those* aspects of text styling plain text. either.


The same reasoning applies to handwriting, only more so.

--Ken



Re: Dealing with Georgian capitalization in programming languages

2018-10-09 Thread Ken Whistler via Unicode

Martin,

On 10/9/2018 12:47 AM, Martin J. Dürst via Unicode wrote:

- Using the 'capitalize' method to (try to) get the titlecase
  property of a MTAVRULI character. (There's no other way
  currently in Ruby to get the titlecase property.)

There may be others. If you have some ideas, I'd appreciate to know 
about them.


This lets me wonder why the UTC didn't simply declare the titlecase 
property of MTAVRULI to be mkhedruli. Was this considered or not? The 
way things are currently set up, there seems to be no benefit of 
MTAVRULI being its own titlecase, because in actual use, that requires 
additional processing.


Titlecasing for Georgian was not completely thought through before 
Mtavruli was added. As I noted in my earlier comment on this thread, the 
titlecase mapping values for Mkhredruli were added late in the process, 
when it became clear that not doing so would result in inappropriate 
outcomes for existing Mkhredruli text.


I don't think there is a fully-worked out position on this, but adding a 
Simple_Titlecase mapping for Mtavruli to Mkhedruli would, I suspect, 
just further muddy waters for implementers, because it would be in 
effect saying that an uppercase letter titlecases by shifting to its 
lowercase mapping. A headscratcher, at the very least.


Note that with the current mappings as they are, Changes_When_Titlecased 
is False for all Mkhedruli and for all Mtavruli characters, which I 
think is the desired state of affairs. A titlecasing string operation of 
Mtavruli that does something other than just leave the string alone 
should, IMO, be documented as doing something extra and *should* have to 
do additional processing.


--Ken



Re: Dealing with Georgian capitalization in programming languages

2018-10-02 Thread Ken Whistler via Unicode



On 10/2/2018 12:45 AM, Martin J. Dürst via Unicode wrote:
capitalize: uppercase (or title-case) the first character of the 
string, lowercase the rest



When I say "cause problems", I mean producing mixed-case output. I 
originally thought that 'capitalize' would be fine. It is fine for 
lowercase input: I stays lowercase because Unicode Data indicates that 
titlecase for lowercase Georgian letters is the letter itself. But it 
will produce the apparently undesirable Mixed Case for ALL UPPERCASE 
input.


My questions here are:
- Has this been considered when Georgian Mtavruli was discussed in the
  UTC?

Not explicitly, that I recall. The whole issue of titlecasing came up 
very late in the preparation of case mapping tables for Mtavruli and 
Mkhedruli for 11.0.


But it seems to me that the problem you are citing can be avoided if you 
simply rethink what your "capitalize" means. It really should be 
conceived of as first lowercasing the *entire* string, and then 
titlecasing the *eligible* letters -- i.e., usually the first letter. 
(Note that this allows for the concept that titlecasing might then be 
localized on a per-writing-system basis -- the issue would devolve to 
determining what the rules are for "eligible" letters.) But the simple 
default would just be to titlecase the initial letter of each "word" 
segment of a string.


Note that conceived this way, for the Georgian mappings, where the 
titlecase mapping for Mkhedruli is simply the letter itself, this 
approach ends up with:


capitalize(mkhedrulistring) --> mkhedrulistring

capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> 
mkhedrulistring


Thus avoiding any mixed case.

--Ken



Re: UCD in XML or in CSV?

2018-08-31 Thread Ken Whistler via Unicode




On 8/31/2018 1:36 AM, Manuel Strehl via Unicode wrote:

For codepoints.net I use that data to stuff everything in a MySQL
database.


Well, for some sense of "everything", anyway. ;-)

People having this discussion should keep in mind a few significant points.

First, the UCD proper isn't "everything", extensive as it is. There are 
also other significant sets of data that the UTC maintains about 
characters in other formats, as well, including the data files 
associated with UTS #46 (IDNA-related), UTS #39 (confusables mapping, 
etc.), UTS #10 (collation), UTR #25 (a set of math-related property 
values), and UTS #51 (emoji-related). The emoji-related data has now 
strayed into the CLDR space, so a significant amount of the information 
about emoji characters is now carried as CLDR tags. And then there is 
various other information about individual characters (or small sets of 
characters) scattered in the core spec -- some in tables, some not, as 
well as mappings to dozens of external standards. There is no actual 
definition anywhere of what "everything" actually is. Further, it is a 
mistake to assume that every character property just associates a simple 
attribute with a code point. There are multiple types of mappings, 
complex relational and set properties, and so forth.


The UTC attempts to keep a fairly clear line around what constitutes the 
"UCD proper" (including Unihan.zip), in part so that it is actually 
possible to run the tools that create the XML version of the UCD, for 
folks who want to consume a more consistent, single-file format version 
of the data. But be aware that that isn't everything -- nor would there 
be much sense in trying to keep expanding the UCD proper to actually 
represent "everything" in one giant DTD.


Second, one of the main obligations of a standards organization is 
*stability*. People may well object to the ad hoc nature of the UCD data 
files that have been added over the years -- but it is a *stable* 
ad-hockery. The worst thing the UTC could do, IMO, would be to keep 
tweaking formats of data files to meet complaints about one particular 
parsing inconvenience or another. That would create multiple points of 
discontinuity between versions -- worse than just having to deal with 
the ongoing growth in the number of assigned characters and the 
occasional addition of new data files and properties to the UCD.


Keep in mind that there is more to processing the UCD than just 
"latest". People who just focus on grabbing the very latest version of 
the UCD and updating whatever application they have are missing half the 
problem. There are multiple tools out there that parse and use multiple 
*versions* of the UCD. That includes the tooling that is used to 
maintain the UCD (which parses *all* versions), and the tooling that 
creates UCD in XML, which also parses all versions. Then there is 
tooling like unibook, to produce code charts, which also has to adapt to 
multiple versions, and bidi reference code, which also reads multiple 
versions of UCD data files. Those are just examples I know off the top 
of my head. I am sure there are many other instances out there that fit 
this profile. And none of the applications already built to handle 
multiple versions would welcome having to permanently build in tracking 
particular format anomalies between specific versions of the UCD.


Third, please remember that folks who come here complaining about the 
complications of parsing the UCD are a very small percentage of a very 
small percentage of a very small percentage of interested parties. 
Nearly everybody who needs UCD data should be consuming it as a 
secondary source (e.g. for reference via codepoints.net), or as a 
tertiary source (behind specialized API's, regex, etc.), or as an end 
user (just getting behavior they expect for characters in applications). 
Programmers who actually *need* to consume the raw UCD data files and 
write parsers for them directly should actually be able to deal with the 
format complexity -- and, if anything, slowing them down to make them 
think about the reasons for the format complexity might be a good thing, 
as it tends to put the lie to the easy initial assumption that the UCD 
is nothing more than a bunch of simple attributes for all the code points.


--Ken



Re: Private Use areas

2018-08-21 Thread Ken Whistler via Unicode


On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote:

On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:

On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:

Is there a block of RTL PUA also?

No.

Perhaps there should be?


This is a periodic suggestion that never goes anywhere--for good reason. 
(You can search the email archives and see that it keeps coming up.)


Presuming that this question was asked in good faith...



What about designating a part of the PUA to have a specific property?


The problem with that is that assigning *any* non-default property to 
any PUA code point would break existing implementations' assumptions 
about PUA character properties and potentially create havoc with 
existing use.



Only certain properties matter enough:


That is an un-demonstrated assertion that I don't think you have thought 
through sufficiently.



* wide
* RTL


RTL is not some binary counterpart of LTR. There are 23 values of 
Bidi_Class, and anyone who wanted to implement a right-to-left script in 
PUA might well have to make use of multiple values of Bidi_Class. Also, 
there are two major types of strong right-to-leftness: Bidi_Class=R and 
Bidi_Class=AL. Should a "RTL PUA" zone favor Arabic type behavior or 
non-Arabic type behavior?



* combining


Also not a binary switch. Canonical_Combining_Class is a numeric value, 
and any value but ccc=0 for a PUA character would break normalization. 
Then for the General_Category, there are three types of "marks" that 
count as combining: gc=Mn, gc=Mc, gc=Me. Which of those would be favored 
in any PUA assignment?



as most others are better represented in the font itself.


Really? Suppose someone wants to implement a bicameral script in PUA. 
They would need case mappings for that, and how would those be "better 
represented in the font itself"? Or how about digits? Would numeric 
values for digits be "better represented in the font itself"? How about 
implementation of punctuation? Would segmentation properties and 
behavior be "better represented in the font itself"?




This could be done either by parceling one of existing PUA ranges: planes 15
and 16 are virtually unused thus any damage would be negligible;


That is simply an assertion -- and not the kind of assertion that the 
UTC tends to accept on spec. I rather suspect that there are multiple 
participants on this email list, for example, who *do* have 
implementations making extensive use of Planes 15/16 PUA code points for 
one thing or another.



  or perhaps
by allocating a new range elsewhere.

See:

https://www.unicode.org/policies/stability_policy.html

The General_Category property value Private_Use (Co) is immutable: the 
set of code points with that value will never change.


That guarantee has been in place since 1996, and is a rule that binds 
the UTC. So nope, sorry, no more PUA ranges.

Meow!


Grrr! ;-)

As I see it, the only feasible way for people to get specialized 
behavior for PUA ranges involves first ceasing to assume that somehow 
they can jawbone the UTC into *standardizing* some ranges for some 
particular use or another. That simply isn't going to happen. People who 
assume this is somehow easy, and that the UTC are a bunch of boneheads 
who stand in the way of obvious solutions, do not -- I contend -- 
understand the complicated interplay of character properties, stability 
guarantees, and implementation behavior baked into system support 
libraries for the Unicode Standard.


The way forward for folks who want to do this kind thing is:

1. Define a *protocol* for reliable interchange of custom character 
property information about PUA code points.


2. Convince more than one party to actually *use* that protocol to 
define sets of interchangeable character property definitions.


3. Convince at least one implementer to support that protocol to create 
some relevant interchangeable *behavior* for those PUA characters.


And if the goal for #3 is to get some *system* implementer to support 
the protocol in widespread software, then before starting any of #1, #2, 
or #3, you had better start instead with:


0. Create a consortium (or other ongoing organization) with a 10-year 
time horizon and participation by at least one major software 
implementer, to define, publicize, and advocate for support of the 
protocol. (And if you expect a major software implementer to 
participate, you might need to make sure you have a business case 
defined that would warrant such a 10-year effort!)


--Ken



Re: Private Use areas

2018-08-20 Thread Ken Whistler via Unicode




On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
Is there a block of RTL PUA also? 


No.

--Ken


Re: Tales from the Archives

2018-08-20 Thread Ken Whistler via Unicode

Steffen noted:


On 8/20/2018 3:22 PM, Steffen Nurpmeso via Unicode wrote:

It was just that i have read on one of the mailing-lists i am
subscribed to a cite of a Unicode statement that i have never read
of anything on the Unicode mailing-list.  It is very awkward, but
i_again_  cannot find what attracted my attention, even with the
help of a search machine.  I think "faith alone will reveal the
true name of shuruq" (1997-07-18).

--steffen


Fortunately, since I collect everything, this one has not been lost to 
the mists of

history yet. So here you go, another "tale from the archives", aka "every
character has a story".

--Ken

===

From kenw Thu Sep 18 14:23 PDT 1997
Date: Thu, 18 Sep 1997 14:20:29 -0700
From: kenw (Kenneth Whistler)
Message-Id: <9709182120.aa16...@birdie.sybase.com>
To: unicode@unicode.org
Subject: War over 'shuruq' narrowly averted
Cc: kenw


Dateline: Geneva, Thursday, September 18, 1997

The ISOnominalists and the SInominalists met today at
the bargaining table in their long-running dispute over
whether the correct name of U+05BC should be:

HEBREW POINT DAGESH OR MAPIQ (shuruq)

or

HEBREW POINT DAGESH OR MAPIQ OR SHURUQ

After considerable posturing and threats by both sides,
opposing camps reluctantly agreed that a compromise
solution was preferable to open flamewar. Unnamed sources
state that the new name to be revealed in a press
conference this evening is:

HEBREW POINT DAGESH OR MAPIQ (or shuruq)

Both sides have also now agreed to focus their attention
jointly at countering the antinomianist camp, which claims
that no names can be imposed by human moral strictures,
and that faith alone will reveal the true name of
shuruq.

=



Re: Tales from the Archives

2018-08-20 Thread Ken Whistler via Unicode

Steffen,

Are you looking for the Unicode list email archives?

https://www.unicode.org/mail-arch/

Those contain list content going back all the way to 1994.

--Ken


On 8/20/2018 6:08 AM, Steffen Nurpmeso via Unicode wrote:

I have the impression that many things which have been posted here
some years ago are now only available via some Forums or other
browser based services.  What is posted here seems to be mostly
a duplicate of the blog only.




Re: UAX #9: applicability of higher-level protocols to bidi plaintext

2018-07-18 Thread Ken Whistler via Unicode



On 7/18/2018 6:43 AM, philip chastney via Unicode wrote:

there are also contexts where "Hello World!" can be read as
the function "Hello", applied to the factorial value of "World"

even though such a move wouldn't necessarily remove all ambiguity,
the easiest solution is to declare that formal notations cannot be "plain" text



Of course they can -- and (usually) should be, as they are designed that 
way. To state otherwise would just create headaches for designing 
parsers for formal notations.


I think you are confusing ambiguity of *interpretation* of bits of 
formal notation, taken out of context, with ambiguity of *display* of 
formal notations in contexts where one does not know and control the 
paragraph directionality.


The easiest (and correct) solution, when displaying formal notation for 
visual interpretation by human readers, is to use tools where one knows 
and can rely on the paragraph directionality explicitly, so that Unicode 
bidi doesn't add an out-of-left-field set of display conundrums, as it 
were, for bidi edge cases that can result in *mis*interpretation by the 
reader.


In other words, if I am trying to read C program text or regex 
expressions, I expect that my tooling is not going to silently assume a 
RTL paragraph directional context and present me with visual garbage to 
interpret, forcing me to reverse engineer the bidi algorithm in my head, 
just to read the text. Why would I put up with that?


--Ken



Re: UAX #9: applicability of higher-level protocols to bidi plaintext

2018-07-16 Thread Ken Whistler via Unicode



On 7/16/2018 3:51 PM, Shai Berger via Unicode wrote:

And I should add, in response to the other points raised in this
thread, from the same page in the core standard: "If the same plain text
sequence is given to disparate rendering processes, there is no
expectation that rendered text in each instance should have the same
appearance. Instead, the disparate rendering processes are simply
required to make the text legible according to the intended reading."
That paragraph ends with the following summary, emphasized in the
source:

Plain text must contain enough information to permit the text
to be rendered legibly, and nothing more.

The last answer inhttp://www.unicode.org/faq/bidi.html  violates this
dictum, as I have showed here with different examples. As long as it
stands, the Unicode standard fails its own criteria.


I've been trying to following your reasoning in this long thread, but am 
still not
finding much to convince that there is anything wrong in the #bidi8 FAQ 
entry

that you keep claiming is wrong.

First, for your "Hello, world!" example, in a rendering that imposes a 
RTL directional

context, the correct, conformant display of that string is:

!Hello, world

as you cited in your earlier example. To do otherwise, would represent a 
*non*-conformant

implementation of the UBA.

So your complaint seems to boil down to the claim that if you transmit 
"Hello, world!" to
a process which then renders it conformantly according to the Unicode 
Standard (including
UBA), then that process must somehow know *and honor* your intent that 
it display
in a LTR directional context. That information, however, is explicitly 
*not* contained in
the plain text string there, and has to be conveyed by means of a 
higher-level protocol.

(E.g. HTML markup as dir="ltr", etc.)

If the receiving process, by whatever means, has raised its hand and 
says, effectively,
"I assume a RTL context for all text display", that is its right. You 
can't complain if it
displays your "Hello, world!" as shown above. Well, you *can* complain, 
but you
wouldn't be correct. Basically, you and the receiving process do not 
share the same
assumptions about the higher-level protocol involved which specifies 
paragraph

direction.

So as I see it, you are either wanting the plain text to somehow contain 
and enforce
upon the renderer your assumption about the directional context that it 
should be
displayed in, OR, you are just unhappy about the bidirectional rendering 
conundrums
of some edge cases for the UBA. In either case, the remedy is the 
application of
LTR characters to provide context (or directional isolate controls, or 
explicit

higher-level markup).

--Ken



Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-29 Thread Ken Whistler via Unicode




On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote:

How would one know that they are misapplied?  And what if the author of
the text has broken your rules? Are such texts never to be transcribed
to pukka Unicode?


Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061, 
Script=Latin) doesn't automatically make the Tamil vowel "inherit" the 
Latin script property value, nor should it.


That said, if someone decides they want that sequence, and their text as 
"broken my rules", so be it. I'm just not going to assume anything 
particular about that text. Note that in terms of trying to determine 
whether such a string is (naively) alphabetic, such a sequence doesn't 
interfere with the determination. On the other hand, a process concerned 
about text runs, script assignment, validity for domains, or other such 
issues *will* be sensitive to such a boundary -- and should not be 
overruled by some generic determination that combining marks inherit all 
the properties of their base.





Even without knowing exactly what is wanted, it looks to me as though
it isn't.  If he wants to allow  as a substring, which
he should, then that fails because there is no overlap between
p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}.


Yes, so if you are working with strings for Indic scripts (or for that 
matter, Arabic), you add Join_Control to the mix:


Alphabetic  ∪ Diacritic ∪ Extender ∪ Join_Control

gets you a decent approximation of what is (naively) expected to fall 
within an "alphabetic" string for most scripts.


For those following along, Alphabetic is roughly meant to cover the ABC, 
かきくけこ,... plus ideographic elements of most scripts. Diacritic picks up 
most of the applied combining marks, including nuktas, viramas, and tone 
marks. Extender picks up spacing elements that indicate length, 
reduplication, iteration, etc. And joiners are, well, joiners.


If one wants finer categorization specifically for Indic scripts, then I 
would suggest turning to the Indic_Syllabic_Category property instead of 
a union of PropList.txt properties and/or some twiddling with 
General_Category values.


--Ken





Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Ken Whistler via Unicode




On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote:
One of the general principles is that combining marks inherit the 
property of their base character.


Normally, "inherited" should be the only property value for combining 
marks.


There have been some deviations from this over the years, for various 
reasons, and there are some properties (such as general category) 
where it is necessary to recognize the character as combining, but the 
general principle still holds.


Therefore, if you are trying to see whether a string is alphabetic, 
combining marks should be "transparent" to such an algorithm.


Generally, good advice. But there are clear exceptions. For example, the 
enclosing combining marks for symbols are intended (basically) to make 
symbols of a sort. And many combining marks have explicit script 
assigments, so they cannot simply willy-nilly inherit the script of a 
base letter if they are misapplied, for example.


This is why I recommend simply adding the Diacritic property into the 
mix for testing a string. That is a closer approximation to the kind of 
naive "Is this string alphabetic?" question that SunaraRaman was asking 
about -- it picks up the correct subset of combining marks to union with 
the set of actual isAlphabetic characters, to produce more expected 
results. (Including, of course, the correct classification of all the 
viramas, stackers, and killers, as well as picking up all the nuktas.).


Folks, please examine the set of character for Diacritic and for 
Extender in:


http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

to see what I'm talking about. The stuff you are looking for is already 
there.


--Ken

P.S. And please don't start an argument about the fact that a "virama" 
isn't really a "diacritic". We know that, too. ;-)





Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?

2018-05-28 Thread Ken Whistler via Unicode



On 5/28/2018 9:23 PM, Martin J. Dürst via Unicode wrote:

Hello Sundar,

On 2018/05/28 04:27, SundaraRaman R via Unicode wrote:

Hi,

In languages like Ruby or Java
(https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), 


functions to check if a character is alphabetic do that by looking for
the 'Alphabetic'  property (defined true if it's in one of the L
categories, or Nl, or has 'Other_Alphabetic' property). When parsing
Tamil text, this works out well for independent vowels and consonants
(which are in Lo), and for most dependent signs (which are in Mc or Mn
but have the 'Other_Alphabetic' property), but the very common pulli 
(VIRAMA)

is neither in Lo nor has 'Other_Alphabetic', and so leads to
concluding any string containing it to be non-alphabetic.

This doesn't make sense to me since the Virama  “◌்” as much of an
alphabetic character as any of the "Dependent Vowel" characters which
have been given the 'Other_Alphabetic' property. Is there a rationale
behind this difference, or is it an oversight to be corrected?


I suggest submitting an error report via 
https://www.unicode.org/reporting.html. I haven't studied the issue in 
detail (sorry, just no time this week), but it sounds reasonable to 
give the VIRAMA the 'Other_Alphabetic' property.


Please don't. This is not an error in the Unicode property assignments, 
which have been stable in scope for Alphabetic for some time now.


The problem is in assuming that the Java or Ruby isAphabetic() API, 
which simply report the Unicode property value Alphabetic for a 
character, suffices for identifying a string as somehow "wordlike". It 
doesn't.


The approximation you are looking for is to add Diacritic to Alphabetic. 
That will automatically pull in all the nuktas and viramas/killers for 
Brahmi-derived scripts. It also will pull in the harakat for Arabic and 
similar abjads, which are also not Alphabetic in the property values. 
And it will pull in tone marks for various writing systems.


For good measure, also add Extender, which will pick up length marks and 
iteration marks.


Please do not assume that the Alphabetic property just automatically 
equates to "what I would write in a word". Or that it should be adjusted 
to somehow make that happen. It would be highly advisable to study *all* 
the UCD properties in more depth, before starting to report bugs in one 
or another simply because using a single property doesn't produce the 
string classification one assumes should be correct in a particular case.


Of course, to get a better approximation of what actually constitutes a 
"word" in a particular writing system, instead of using raw property 
API's, one should be using a WordBreak iterator, preferably one tailored 
for the language in question.


--Ken




I'd recommend to mention examples other than Tamil in your report 
(assuming they exist).


BTW, what's the method you are using in Ruby? If there's a problem in 
Ruby (which I don't think; it's just using Unicode data), then please 
make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I 
should be able to follow up on that.


Regards,   Martin.





Re: Major vendors changing U+1F52B PISTOL  depiction from firearm to squirt gun

2018-05-23 Thread Ken Whistler via Unicode


On 5/23/2018 8:53 AM, Abe Voelker via Unicode wrote:
As a user I find it troublesome because previous messages I've sent 
using this character on these platforms may now be interpreted 
differently due to the changed representation. That aspect has me 
wondering if this change is in line with Unicode standard conformance 
requirements.




The Unicode Standard publishes only *text presentation* (black and 
white) representative glyphs for emoji characters. And those text 
presentation glyphs have been quite stable in the standard. For U+1F52B 
PISTOL, the glyph currently published in Unicode 10.0 (and the one which 
will be published imminently in Unicode 11.0) is precisely the same as 
the glyph that was initially published nearly 8 years ago in Unicode 
6.0. Care to check up on that?


Unicode 6.0: https://www.unicode.org/charts/PDF/Unicode-6.0/U60-1F300.pdf

Unicode 11.0: https://www.unicode.org/charts/PDF/Unicode-11.0/U110-1F300.pdf

What vendors do for their colorful *emoji presentation* glyphs is 
basically outside the scope of the Unicode Standard. Technically, it is 
outside the scope even of the separate Unicode Technical Standard #51, 
Unicode Emoji, which specifies data, behavior, and other mechanisms for 
promoting interoperability and valid interchange of emoji characters and 
emoji sequences, but which does *not* try to constrain vendors in their 
emoji glyph designs.


Now, sure, nobody wants their emoji for an avocado, to willy-nilly turn 
into a completely unrelated emoji for a crying face. But many emoji are 
deliberately vague in their scope of denotation and connotation, and the 
vendors have a lot a leeway to design little images that they like and 
their customers like. And the Unicode Standard does not now and probably 
never will try to define and enforce precise semantics and usage rules 
for every single emoji character.


Basically, it is a fool's game to be using emoji as if they were a 
well-defined and standardized pictographic orthography with unchanging 
semantics. If you want stable presentation of content, use a pdf 
document or an image. If you want stable and accurate conveyance of 
particular meaning -- well, write it out in the standard orthography of 
a particular language. If you want playful and emotional little 
pictographs accompanying text, well, then don't expect either stability 
of the images or the meaning, because that isn't how emoji work. Case in 
point: if you are using U+1F351 PEACH for its well-known resemblance to 
a bum, well, don't complain to the Unicode Consortium if a phone vendor 
changes the meaning of your message by redesigning its emoji glyph for 
U+1F351 to a cut peach slice that more resembles a smile.


--Ken




Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols

2018-05-15 Thread Ken Whistler via Unicode



On 5/15/2018 2:46 PM, Markus Scherer via Unicode wrote:


I am proposing the addition of 2 new characters to the Musical
Symbols table:

- the half-flat sign (lowers a note by a quarter tone)
- the half-sharp sign (raises a note by a quarter tone)


In an actual proposal, I would expect a discussion of whether you are 
proposing to encode established symbols, or whether you are proposing 
new symbols to be adopted by the community (in which case Unicode 
would probably wait & see if they get established).


A proposal should also show evidence of usage and glyph variations.



And should probably refer to the relationship between these signs and 
the existing:


U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP
U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT

which are also half-sharp or half-flat accidentals.

The wiki on flat signs shows this flat with a crossbar, as well as a 
reversed flat symbol, to represent the half-flat.


And the wiki on sharp signs shows this sharp minus one vertical bar to 
represent the half-sharp.


So there may be some use of these signs in microtonal notation, outside 
of an Arabic context, as well. See:


https://en.wikipedia.org/wiki/Accidental_(music)#Microtonal_notation

--Ken



Re: Is the Editor's Draft public?

2018-04-20 Thread Ken Whistler via Unicode

Henri,

There is no formal concept of a public "Editor's Draft" for the Unicode 
core specification. This is mostly the result of the tools used for 
editing the core specification, which is still structured more like a 
book than the usual online internet specification.


Currently the Unicode editors are finishing up the 11.0 core 
specification editing -- and the chapters for that will be available in 
June, 2018, as noted on the current draft of the Unicode 11.0 page. 
There is no Version 12.0 "Editor's Draft" right now; instead, work on 
the 12.0 core specification will start once the 11.0 chapters have been 
frozen and published.


If you have feedback on the core specification, the best thing to do is 
simply to submit it now as part of the current 11.0 beta review, 
referring to the published 10.0 core specification text. If it is a 
small item, such as a typo, there is always the possibility that it has 
already been reported and fixed, of course -- but it won't hurt to 
report and check. Suggestions for larger changes in the text will be 
added to the pile for future consideration by the UTC and the editors, 
and likely would be taken up for the 12.0 core specification.


--Ken


On 4/20/2018 3:14 AM, Henri Sivonen via Unicode wrote:

Thank you. I checked this review announcement (I should have said so
in my email; sorry), but it leads me to
https://unicode.org/versions/Unicode11.0.0/  which says the chapters
will be "Available June 2018". But even if the 11.0 chapters were
available, I'd expect there to exist an Editor's Draft that's now in a
post-11.0 but pre-12.0 state.

I guess I should just send my comments and take the risk of my
concerns already having been addressed.




Re: Fwd: RFC 8369 on Internationalizing IPv6 Using 128-Bit Unicode

2018-04-02 Thread Ken Whistler via Unicode



On 4/2/2018 7:02 PM, Philippe Verdy via Unicode wrote:
We're missing the definition of "ymojis", a safer alternatives of 
"umojis" (unknown), but that "you" can create yourself for use by 
yourself 


Not to mention "əmojis", as in "Uh, Moe! Jeez, why are we still talking 
about this?!"


--Ken



Re: Unicode Emoji 11.0 characters now ready for adoption!

2018-03-09 Thread Ken Whistler via Unicode



On 3/9/2018 9:29 AM, via Unicode wrote:
Documented increase such as scientific terms for new elements, flora 
and fauna, would seem to be not more one or two dozen a year. 


Indeed. Of the "urgently needed characters" added to the unified CJK 
ideographs for Unicode 11.0, two were obscure place name characters 
needed to complete mapping for the Japanese IT mandatory use of the Moji 
Joho collection.


The other three were newly standardized Chinese characters for 
superheavy elements that now have official designations by the IUPAC (as 
of December 2015): Nihonium (113), Tennessine (117) and Oganesson (118). 
The Chinese characters coined for those 3 were encoded at U+9FED, 
U+9FEC, and U+9FEB, respectively.


Oganesson, in particular, is of interest, as the heaviest known element 
produced to date. It is the subject of 1000's of hours of intense 
experimentation and of hundreds of scientific papers, but:


   ... since 2005, only five (possibly six) atoms of the nuclide ^294
   Og have been detected.


But we already have a Chinese character (pronounced ào) for Og, and a 
standardized Unicode code point for it: U+9FEB.


Next up: unobtanium and hardtofindium

--Ken



Re: Translating the standard

2018-03-09 Thread Ken Whistler via Unicode



On 3/9/2018 6:58 AM, Marcel Schneider via Unicode wrote:

As of translating the Core spec as a whole, why did two recent attempts crash 
even
before the maintenance stage, while the 3.1 project succeeded?


Essentially because both the Japanese and the Chinese attempts were 
conceived of as commercial projects, which ultimately did not cost out 
for the publishers, I think. Both projects attempted limiting the scope 
of their translation to a subset of the core spec that would focus on 
East Asian topics, but the core spec is complex enough that it does not 
abridge well. And I think both projects ran into difficulties in trying 
to figure out how to deal with fonts and figures.


The Unicode 3.0 translation (and the 3.1 update) by Patrick Andries was 
a labor of love. In this arena, a labor of love is far more likely to 
succeed than a commercial translation project, because it doesn't have 
to make financial sense.


By the way, as a kind of annotation to an annotated translation, people 
should know that the 3.1 translation on Patrick's site is not a straight 
translation of 3.1, but a kind of interpreted adaptation. In particular, 
it incorporated a translation of UAX #15, Unicode Normalization Forms, 
Version 3.1.0, as a Chapter 6 of the translation, which is not the 
actual structure of Unicode 3.1. And there are other abridgements and 
alterations, where they make sense -- compare the resources section of 
the Preface, for example. This is not a knock on Patrick's excellent 
translation work, but it does illustrate the inherent difficulties of 
trying to approach a complete translation project for *any* version of 
the Unicode Standard.


--Ken



Re: Unicode Emoji 11.0 characters now ready for adoption!

2018-03-07 Thread Ken Whistler via Unicode



On 3/7/2018 1:12 PM, Philippe Verdy via Unicode wrote:
Shouldn't we create a variant of IDS, using combining joiners between 
Han base glyphs (then possibly augmented by variant selectors if there 
are significant differences on the simplification of rendered strokes 
for each component) ? What is really limiting us to do that ?




Ummm ambiguity, lack of precision, complexity of model, pushback by 
stakeholders, likely failure of uptake by most implementers, duplication 
of representation, ...


Do you think combining models of Han weren't already thought of years 
ago? They predated the original encoding of unified CJK in Unicode in 
1992. They weren't viable then, and they aren't viable now, either, 
after 26 years of Unicode implementation of unified CJK as atomic 
ideographs.


--Ken



Translating the standard (was: Re: Fonts and font sizes used in the Unicode)

2018-03-05 Thread Ken Whistler via Unicode


On 3/5/2018 9:03 AM, suzuki toshiya via Unicode wrote:

I have a question; if some people try to make a
translated version of Unicode


And to add to Asmus' response, folks on the list should understand that 
even with the best of effort, the concept of a "translated version of 
Unicode" is a near impossibility. In fairly recent times, two serious 
efforts to translate *just *the core specification -- one in Japanese, 
and a somewhat later attempt for Chinese -- crashed and burned, for a 
variety of reasons. The core specification is huge, contains a lot of 
very specific technical terminology that is difficult to translate, 
along with a large collection of script- and language-specific detail, 
also hard to translate. Worse, it keeps changing, with updates now 
coming out once every year. Some large parts are stable, but it is 
impossible to predict what sections might be impacted by the next year's 
encoding decisions.


That is not including that fact that "the Unicode Standard" now also 
includes 14 separate HTML (or XHTML) annexes, all of which are also 
moving targets, along with the UCD data files, which often contain 
important information in their headers that would also require 
translation. And then, of course, there are the 2000+ pages of the 
formatted code charts, which require highly specific and very 
complicated custom tooling and font usage to produce.


It would require a dedicated (and expensive) small army of translators, 
terminologists, editors, programmers, font designers, and project 
managers to replicate all of this into another language publication -- 
and then they would have to do it again the next year, and again the 
next year, in perpetuity. Basically, given the current situation, it 
would be a fool's errand, more likely to introduce errors and 
inconsistencies than to help anybody with actual implementation.


People who want accessibility to the Unicode Standard in other languages 
need to scale down their expectations considerably, and focus on 
preparing reasonably short and succinct introductions to the terminology 
and complexity involved in the full standard. Such projects are 
feasible. But a full translation of "the Unicode Standard" simply is not.


--Ken


CJK Ideograph Encoding Velocity (was: Re: Unicode Emoji 11.0 characters now ready for adoption!)

2018-03-05 Thread Ken Whistler via Unicode

John,

I think this may be giving the list a somewhat misleading picture of the 
actual statistics for encoding of CJK unified ideographs. The "500 
characters a year" or "1000 characters a year" limits are administrative 
limits set by the IRG for national bodies (and others) submitting 
repertoire to the "working set" that the IRG then segments into chunks 
for processing to prepare new increments for actual encoding.


In point of fact, if we take 1991 as the base year, the *average* rate 
of encoding new CJK unified ideographs now stands at 3379 per annum 
(87,860 as of Unicode 10.0). By "encoding" here, I mean, final, finished 
publication of the encoded characters -- not the larger number of 
potentially unifiable submissions that eventually go into a publication 
increment. There is a gradual downward drift in that number over time, 
because of the impact on the stats of the "big bang" encoding of 42,711 
ideographs for Extension B back in 2001, but recently, the numbers have 
been quite consistent with an average incremental rate of about 3000 new 
ideographs per year:


5762 added for Extension E in 2015

7463 added for Extension F in 2017

~ 4934 to be added for Extension G, probably to be published in 2020

If you run the average calculation including Extension G, assuming 2020, 
you end up with a cumulative per annum rate of 3200, not much different 
than the calculation done as of today.


And as for the implication that China, in particular, is somehow limited 
by these numbers, one should note that the vast majority of Extension G 
is associated with Chinese sources. Although a substantial chunk is 
formally labeled with a "UK" source this time around, almost all of 
those characters represent a roll-in of systematic simplifications, of 
various sorts, associated with PRC usage. (People who want to check can 
take a look at L2/17-366R in the UTC document registry.)


--Ken


On 3/5/2018 7:13 AM, via Unicode wrote:

Dear All,

to simplify discussion I have split the points. 

Re: Bidi edge cases in Hangul and Indic

2018-02-22 Thread Ken Whistler via Unicode

David,


On 2/22/2018 7:21 PM, David Corbett via Unicode wrote:

My confusion stems from Unicode’s online bidi utility.


That bidi utility has known defects in it. It is not yet conformant with 
changes to UBA 6.3, let alone later changes to UBA. And the mapping of 
memory position to display position in that utility does not take into 
account complex mapping that has to occur in the layout engines and 
fonts in real applications.


--Ken


Re: IDC's versus Egyptian format controls

2018-02-16 Thread Ken Whistler via Unicode



On 2/16/2018 11:00 AM, Asmus Freytag via Unicode wrote:

On 2/16/2018 8:00 AM, Richard Wordingham via Unicode wrote:

That doesn't square well with, "An implementation *may* render a valid
Ideographic Description Sequence either by rendering the individual
characters separately or by parsing the Ideographic Description
Sequence and drawing the ideograph so described." (TUS 10.0 p704, in
Section 18.2)


Emphasis on the "may". In point of fact, no widespread layout engine or 
set of fonts does parse IDS'es to turn them into single ideographs for 
display. That would be a highly specialized display.




Should we ask t make the default behavior (visible IDS characters) 
more explicit?


Ask away.

--Ken



I don't mind allowing the other as an option (it's kind of the reverse 
of the "show invisible"

mode, which we also allow, but for which we do have a clear default).




Re: IDC's versus Egyptian format controls

2018-02-16 Thread Ken Whistler via Unicode


On 2/16/2018 8:22 AM, Ken Whistler wrote:
The Egyptian quadrat controls, on the other hand, are full-fledged 
Unicode format controls.


One more point of distinction: The (gc=So) IDC's follow a syntax that 
uses Polish notation order for the descriptive operators (inherited from 
the intended use in GB 18030, where these came from in the first place). 
That order minimizes ambiguity of representation without requiring 
bracketing, but it has the disadvantage of being hard for humans to 
interpret easily in complicated cases.


The Egyptian format controls use an infix notation, instead. That 
follows current Egyptologists' practice of representing quadrats with 
MdC conventions. It is also a better order for the layout engine 
processing. The disadvantage is that it requires a bracketing notation 
to deal with ambiguities of operator precedence in complicated cases.


--Ken



IDC's versus Egyptian format controls (was: Re: Why so much emoji nonsense?)

2018-02-16 Thread Ken Whistler via Unicode

On 2/16/2018 8:00 AM, Richard Wordingham via Unicode wrote:


A more portable solution for ideographs is to render an Ideographic
Description Sequences (IDS) as approximations to the characters they
describe.  The Unicode Standard carefully does not prohibit so doing,
and a similar scheme is being developed for blocks of Egyptian
Hieroglyphs, and has been proposed for Mayan as well.


A point of clarification: The IDC's (ideographic description characters) 
are explicitly *not* format controls. They are visible graphic symbols 
that sit visibly in text. There is a specified syntax for stringing them 
together into sequences with ideographic characters and radicals to 
*suggest* a specific form of CJK (or other ideographic) character 
assembled from the pieces in a certain order -- but there is no 
implication that a generic text layout process *should* attempt to 
assemble that described character as a single glyph. IDC's are a 
*description* methodology. IDC's are General_Category=So.


The Egyptian quadrat controls, on the other hand, are full-fledged 
Unicode format controls. They do not just describe hieroglyphic quadrats 
-- they are intended to be implemented in text format software and 
OpenType fonts to actually construct and display fully-formed quadrats 
on the fly. They will be General_Category=Cf. Mayan will work in a 
similar manner, although the specification of the sign list and exact 
required set of format controls is not yet as mature as that for Egyptian.


--Ken



Re: Why so much emoji nonsense?

2018-02-15 Thread Ken Whistler via Unicode



On 2/15/2018 2:24 PM, Philippe Verdy via Unicode wrote:
And it's in the mission of Unicode, IMHO, to promote litteracy 


Um, no. And not even literacy, either. ;-)

https://en.wikipedia.org/wiki/Category:Organizations_promoting_literacy

--Ken




Re: Why so much emoji nonsense?

2018-02-14 Thread Ken Whistler via Unicode



On 2/14/2018 12:49 PM, Philippe Verdy via Unicode wrote:



RCLLTHTWHNLPHBTSWRFRSTNVNTDPPLWRTTXTLKTHS !




[ ... lots to say about the history of writing ... ]

And the use (or abuse) of emojis is returning us to the prehistory 
when people draw animals on walls of caverns: this was a very slow 
communication, not giving a rich semantic, full of ambiguities about 
what is really meant, and in fact a severe loss of knowledge where 
people will not communicate easily and rapidly.


=-O Perhaps Philippe was missing my point about how and why emoji are 
actually used.


--Ken



Re: Why so much emoji nonsense?

2018-02-14 Thread Ken Whistler via Unicode


On 2/14/2018 12:53 AM, Erik Pedersen via Unicode wrote:

Unlike text composed of the world’s traditional alphabetic, syllabic, abugida 
or CJK characters, emoji convey no utilitarian and unambiguous information 
content.


I think this represents a misunderstanding of the function of emoji in 
written communication, as well as a rather narrow concept of how writing 
systems work and why they have evolved.


RECALLTHATWHENALPHABETSWEREFIRSTINVENTEDPEOPLEWROTETEXTLIKETHIS

The invention and development of word spacing, punctuation, and casing, 
among other elements of typography, represent the addition of meta-level 
information to written communication that assists in legibility, helps 
identify lexical and syntactic units, conveys prosody, and other 
information that is not well conveyed by simply setting down letters of 
an alphabet one right after the other.


Emoticons were invented, in large part, to fill another major hole in 
written communication -- the need to convey emotional state and 
affective attitudes towards the text. This is the kind of information 
that face-to-face communication has a huge and evolutionarily deep 
bandwidth for, but which written communication typically fails miserably 
at. Just adding a little happy face :-) or sad face :-( to a short email 
manages to convey some affect much more easily and effectively than 
adding on entire paragraphs trying to explain how one feels about what 
was just said. Novelists have the skill to do that in text without using 
little pictographic icons, but most of us are not professional writers! 
Note that emoticons were invented almost as soon as people started 
communicating in digital mediums like email -- so long predate anything 
Unicode came up with.


Other kinds of emoji that we've been adding recently may have a somewhat 
more uncertain trajectory, but the ones that seem to be most successful 
are precisely those which manage to connect emotionally with people, and 
which assist them in conveying how they *feel* about what they are writing.


So I would suggest that people not just dismiss (or diss) this ongoing 
phenomenon. Emoji are widely used for many good reasons. And of course, 
like any other aspect of writing, get mis-used in various ways, as well. 
But you can be sure that their impact on the evolution of world writing 
is here to stay and will be the topic of serious scholastic papers by 
scholars of writing for decades to come. ;-)


--Ken




Re: Word_Break for Hieroglyphs

2017-12-14 Thread Ken Whistler via Unicode

Gentlemen,


On 12/14/2017 6:53 AM, Mark Davis ☕️ via Unicode wrote:
Thus I would like people who are both knowledgeable about hieroglyphs 
/and/ Unicode properties to weigh in. I know that people like Andrew 
Glass are on this list, who satisfy both criteria.

​
And what constitutes a cluster?


This entire discussion is premature. The model for Egyptian is in flux 
right now. What constitutes a "quadrat", which is significantly relevant 
to any determination of how other segmentation properties should work 
for Egyptian hieroglyphics, will depend on the details of the model and 
how quadrat formation interacts with the exact set of format controls 
eventually agreed upon. See:


http://www.unicode.org/L2/L2017/17112r-quadrat-encoding.pdf

(And please note that that has a reference list of 13 *other* documents. 
This is not simple stuff.)


When we get closure on the Egyptian model, *then* will be the time to 
make suggestions for how Egyptian values for GCB, WB, and LB might we 
adjusted for possible better default behavior.


--Ken



Re: Armenian Mijaket (Armenian colon)

2017-12-05 Thread Ken Whistler via Unicode

Asmus,


On 12/5/2017 12:35 PM, Asmus Freytag via Unicode wrote:

I don't know the history of this particular "unification"


Here are some clues to guide further research on the history.

The annotation in question was added to a draft of the NamesList.txt 
file for Unicode 4.1 on October 7, 2003.


The annotation was not yet in the Unicode 4.0 charts, published in 
April, 2003.


That should narrow down the search for everybody. I can't find specific 
mention of this in the UTC minutes from the relevant 2003 window.


But I strongly suspect that the catalyst for the change was the 
discussion that took place regarding PRI #12 re terminal punctuation:


http://www.unicode.org/review/pr-12.html

That document, at least, does mention "Armenian" and U+2024, although 
not in the same breath. That PRI was discussed and closed at UTC #96, on 
August 25, 2003:


http://www.unicode.org/L2/L2003/03240.htm

I don't find any particular mention of U+2024 in my own notes from that 
meeting, so I suspect the proximal cause for the change to the 
annotation for U+2024 on October 7 will have to be dug out of an email 
archive at some point.


--Ken





Re: implicit weight base for U+2CEA2

2017-09-27 Thread Ken Whistler via Unicode



On 9/27/2017 2:19 PM, Markus Scherer via Unicode wrote:
On Wed, Sep 27, 2017 at 1:49 PM, James Tauber via Unicode 
> wrote:


I recently updated pyuca[1], my pure Python implementation of the
Unicode Collation Algorithm to work with 8.0.0, 9.0.0, and 10.0.0
but to get all the tests to work, I had to special case the
implicit weight base for U+2CEA2. The spec seems to suggest the
base should be FB80 but I had to override just that code point to
have a base of FBC0 for the tests to pass.

Is this a known issue with the spec or something I've missed?


2CEA2..2CEAF are unassigned code points for which the UCA+DUCET uses a 
base of FBC0.


markus


And you may have a range error in Extension E to account for the test 
problem.


The relevant section of CollationTest_SHIFTED_SHORT.txt has tests that 
will pass only if:


2B735 < 2B81E < 2CEA2 < 2EBE1 < 2FFFE
Ext C< Ext D < Ext E < Ext F < non-character

Those are *unassigned* characters just past the assigned ranges but 
still in the blocks in each of those CJK extensions. So if you have a 
range error for assigned characters in Extension E, you'd get a failure 
at that point in the text cases.


--Ken



Re: IBM 1620 invalid character symbol

2017-09-27 Thread Ken Whistler via Unicode

Ken,


On 9/27/2017 11:10 AM, Ken Shirriff via Unicode wrote:
The IBM type catalog might be of interest. It describes in great 
detail the character sets of the IBM typewriters and line printers and 
the custom characters that can be ordered for printer chains and 
Selectric type balls. Link: 
http://bitsavers.org/pdf/ibm/serviceForConsultants/Service_For_Consultants_198312_Complete/15_Type_Catalog.pdf 





That is a very interesting source, though from a much later era (1983). 
In particular, the "Special Character Nomenclature" (p. 11 of the pdf) 
provides a good list of what the IBM typographers at the time thought 
was the range of special symbols they were working within this overall 
collection.


Note the presence of the group mark, the record mark, and the segment 
mark. And in the realm of potential "tofu" indicators, there is the open 
box and the OCR blob, but nothing like the 1620 symbol(s) we've been 
talking about.


On another point, the "pillow" noted for the invalid character in the 
IBM 1620-2 (using the Selectric instead of the older IBM typewriter 
model) was almost certainly also not an actual punch on the Selectric 
type ball, but instead implemented by an overstrike of "[" and "]". See, 
e.g., the Pica 72 type style in the catalog noted above, which looks 
like some of the very earliest Selectric type. Its use could well have 
been occasioned by the fact that the slab serif typewriter font would 
have created a muddy blob if you tried to overstrike an "X" and and "I" 
for this output symbol.


--Ken



Re: IBM 1620 invalid character symbol

2017-09-27 Thread Ken Whistler via Unicode

Asmus,

On 9/27/2017 10:02 AM, Asmus Freytag via Unicode wrote:


In that context it's worth remembering that there while you could say 
for most typewriters that "the typewriter is the font", there were 
noted exceptions. The IBM Selectric, for example, had exchangeable 
type balls which allowed both a font and / or encoding change. 
(Encoding understood here as association of character to key).


That technology was then only two years in the future.



And in some sense, not even... ;-)

By the 1950's (and probably earlier), enterprising linguists and other 
special users were conspiring with skilled typewriter repair experts to 
customize their manual typewriter keyboards and key strikers with custom 
fonts. I have an example sitting in my office -- an old Olympia manual 
typewriter with custom-cast type replacing the standard punches on some 
of the key strikers, and with custom engraved key caps added to the 
keyboard, to add schwa, eng, open-o, etc. to the typewriter. It also has 
the bottom dot of the colon *filed off* to create a middle dot key. 
Typing an actual colon on that machine requires an "input method" 
consisting of 3 key presses: {period, backspace, middledot} A couple of 
the keys that have raised accents on them were modified so as disable 
the platen advance, thereby becoming permanent "dead keys" -- 
effectively emulating the encoding of combining marks. There are 
probably thousands of such customized manual typewriters still sitting 
around, over and beyond the various standard manufactured models.


--Ken


Re: IBM 1620 invalid character symbol

2017-09-27 Thread Ken Whistler via Unicode

Leo,


On 9/26/2017 9:00 PM, Leo Broukhis via Unicode wrote:
The next time I'm at the Mountain View CHM, I'll try to ask. However, 
assuming it was an overstrike of an X and an I, then where does the 
"Eris"-like glyph come from? Was there ever an IBM font with a 
double-semicircular X like )( ?




The reason for focusing on the hardware is that during operation of an 
IBM 1620, that is what would have been printed on paper by the actual 
machines, and what people would have seen  in core dumps, or whatever.


The question of what was printed in the *documentation* is a different 
issue, really. That involves figuring out what the editors/typesetters 
of the manuals were doing to represent a symbol generated by 
overstriking by the hardware, for which they had no convenient type to 
use, by whatever word processing and printing technology they were using 
circa 1959. I suspect that both the "Zhe"-like glyph and the "Eris"-like 
glyph we have seen in the printed copies of the manual are themselves 
typesetter substituted glyphs for whatever the 1620 tofu glyph was that 
they were trying to represent. Where they got those glyphs, I dunno -- 
and it might be pretty difficult to track down, because almost all the 
folks who would have known what IBM manual typesetting practices were 
circa 1959 will have passed on by now.


I don't know of any *standard* IBM glyph for this "Eris"-like thingie 
seen in the scanned bit of manual that started this thread -- but my 
documentation is from the 1980's era listings of standardized glyph 
identifiers. Who knows what was going on circa 1959, which predated most 
of the IBM efforts to standardize large glyph sets and large numbers of 
character sets? Back then, "fonts" consisted of what were cast on the 
typebars of typewriters, or on the strikers of line printers, or the 
physical type that typesetters used.


Look at the archival pictures of the IBM 1620. Do you see any display 
font anywhere? That console is a Star-Trek style computer console -- all 
register lights and bit switches and rows of power station style 
light-up buttons. Not a font anywhere. The only font on that machine can 
be found by feeling the key strikers in the typewriter.


--Ken




Re: IBM 1620 invalid character symbol

2017-09-26 Thread Ken Whistler via Unicode

Philippe,

Those aren't negative digits, per se. The usage in the manual is with an 
overline (or macron) to indicate the flag bit. It does occur over a 
zero, and in explanation in the text of floating point operations, it is 
also shown over letters (X, M, E) representing digits of the exponent 
and mantissa. See p. 27 (31 of the pdf) in that same manual, for an 
extensive discussion with lots of examples in the text:


http://www.bitsavers.org/pdf/ibm/1620/A26-5706-3_IBM_1620_CPU_Model_1_Jul65.pdf

The Unicode representation of the text material printed on that page 
would best be done with a combining macron, I think.


--Ken


On 9/26/2017 6:34 AM, Philippe Verdy via Unicode wrote:
But what is interesting is the use of negative digits (-1 to -9, with 
the minus sign above the digit; I've not seen a case of minus 0, not 
needed apparently by the described operations)
How do you encode these negative decimal digits in Unicode ? with a 
macron diacritic ?






Re: IBM 1620 invalid character symbol

2017-09-26 Thread Ken Whistler via Unicode

Leo,

Yeah, I know. My point was that by examining the physical typewriter 
keys (the striking head on the typebar, not the images on the keypads), 
one could see what could be generated *by* overstriking. I think 
Philippe's suggestion that it was simply an overstrike of "X" with an 
"I" is probably the simplest explanation for the actual operation. And 
the typeset manuals just grabbed some type that looked similar. Note 
that the typewriters in question didn't have a vertical bar or 
backslash, apparently.


But adding an annotation for similar-looking symbols that could be used 
for this is, I agree, probably better than looking for a proposal to 
encode some new symbol for this oddball construction.


If it really is an overstrike, then technically, it could probably also 
be represented as the sequence <0058, 20D2>, just to represent the data.


--Ken


On 9/25/2017 11:34 PM, Leo Broukhis wrote:
If it was implemented as an overprint, either )^H|^H( or \^H|^H/ and 
was intended to signify an invalid character
(for example, in the text part of core dumps, where a period is used 
by hexdump -C), then there would not be a physical key to generate it.




Re: IBM 1620 invalid character symbol

2017-09-25 Thread Ken Whistler via Unicode
The 1620 manual accessed from the Wiki page shows the same information 
but with a different glyph (which looks more like the capital zhe, and 
is presumably the source of the glyph cited in the Wiki page itself). See:


http://www.bitsavers.org/pdf/ibm/1620/A26-5706-3_IBM_1620_CPU_Model_1_Jul65.pdf

p. 52 of the document (56/99 of the pdf).

So there was some significant glyph variation in the 1620 documentation. 
My guess is that the invalid character tofu was implemented as an 
overprint symbol on the 1620 console typewriter (since the overlines and 
the strikethroughs clearly were). The whole system was basically using 
only a 50-character character set. But to verify exactly what was going 
on, somebody would presumably have to examine the physical keys of a 
1620 console typewriter to see what they could generate on paper.


I'm guessing the Computer History Museum ( 
http://www.computerhistory.org/ ) would have one sitting around.


--Ken


On 9/25/2017 9:48 PM, Leo Broukhis via Unicode wrote:
Wikipedia (https://en.wikipedia.org/wiki/IBM_1620#Invalid_character) 
describes the "invalid character" symbol (see attachment) as a 
Cyrillic Ж which it obviously is not.


But what is it? Does it deserve encoding, or is it a glyph variation 
of an existing codepoint?






Re: Rendering variants of U+3127 Bopomofo Letter I

2017-08-24 Thread Ken Whistler via Unicode

Albrecht,

See TUS, Section 18.3, Bopomofo, p. 707:

http://www.unicode.org/versions/Unicode10.0.0/ch18.pdf#G22553

--Ken


On 8/24/2017 12:19 AM, Dreiheller, Albrecht via Unicode wrote:


Hello Chinese experts,

The Letter I in the Bopomofo alphabet (U+3127)has a two rendering 
variants, a vertical bar and a horizontal bar.


Can anyone please tell me the context criteria, when should which 
variant be used?


Is it VR China using the vertical form (like in font SimSun) and 
Taiwan using the horizontal form (like in fontPMingLiU) ?


Thanks

Albrecht





Re: emoji props in the ucdxml ?

2017-07-05 Thread Ken Whistler via Unicode

Manuel,

I suspect that such a link may already be in the works for the 
/Public/emoji/ data directory. But if you want to make sure your 
suggestion is reviewed by the UTC, you should submit it via the contact 
form:


http://www.unicode.org/reporting.html

--Ken

On 7/5/2017 12:37 PM, Manuel Strehl via Unicode wrote:

but are there any plans to integrate the data in the ucdxml [2]
(possibly as separate files) ?

No. Not unless and until they become formally part of the UCD.

In this context: Would it be possible for the maintainers of the TR #51
data files to add a symlink "latest" under
unicode.org/Public/emoji/latest like there is for the UCD? That would be
a tremendous time saver, at least for me, having a constant URL to fetch
the latest Emoji data from.

Who should I ask for such a link?

Cheers,
Manuel





Re: emoji props in the ucdxml ?

2017-07-05 Thread Ken Whistler via Unicode


On 7/5/2017 10:01 AM, Daniel Bünzli via Unicode wrote:

I know the emoji properties [1] are no formally part of the UCD (not sure 
exactly why though),


Because they are maintained as part of an independent standard now (UTS 
#51), which is still on track to have a faster turnaround -- and hence 
faster data updates -- not synched with the annual versions of the 
Unicode Standard. Hence they cannot be formally a part of the UCD -- 
unless the entire Unicode Standard were going to be churned on a faster 
cycle as well.



but are there any plans to integrate the data in the ucdxml [2] (possibly as 
separate files) ?


No. Not unless and until they become formally part of the UCD.

--Ken



Re: Announcing The Unicode® Standard, Version 10.0

2017-06-21 Thread Ken Whistler via Unicode

I wonder IF 9 times suffice,
But IF more are required,
I'll tweet ILY, tweet it twice --
Since spelling's been retired.


On 6/21/2017 8:37 AM, William_J_G Overington via Unicode wrote:

Here is a mnemonic poem, that I wrote on Monday 20 February 2017, now published 
as U+1F91F is now officially in The Unicode Standard.

One eff nine one eff
Is the code number to say
In one symbol
A very special message
To a loved one far away

In an email
Or a message of text






Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Ken Whistler via Unicode


On 6/1/2017 8:32 PM, Richard Wordingham via Unicode wrote:

TUS Section 3 is like the Augean Stables.  It is a complete mess as a
standards document,


That is a matter of editorial taste, I suppose.


imputing mental states to computing processes.


That, however, is false. The rhetorical turn in the Unicode Standard's 
conformance clauses, "A process shall interpret..." and "A process shall 
not interpret..." has been in the standard for 21 years, and seems to 
have done its general job in guiding interoperable, conformant 
implementations fairly well. And everyone -- well, perhaps almost 
everyone -- has been able to figure out that such wording is a shorthand 
for something along the lines of "Any person implementing software 
conforming to the Unicode Standard in which a process does X shall 
implement it in such a way that that process when doing X shall follow 
the specification part Y, relevant to doing X, exactly according to that 
specification of Y...", rather than a misguided assumption that software 
processes are cognitive agents equipped with mental states that the 
standard can "tell what to think".


And I contend that the shorthand works just fine.



Table 3-7 for example, should be a consequence of a 'definition' that
UTF-8 only represents Unicode Scalar values and excludes 'non-shortest
forms'.


Well, Definition D92 does already explicitly limit UTF-8 to Unicode 
scalar values, and explicitly limits the form to sequences of one to 
four bytes. The reason why it doesn't explicitly include the exclusion 
of "non-shortest form" in the definition, but instead refers to Table 
3-7 for the well-formed sequences (which, btw explicitly rule out all 
the non-shortest forms), is because that would create another 
terminological conundrum -- trying to specify an air-tight definition of 
"non-shortest form (of UTF-8)" before UTF-8 itself is defined. It is 
terminologically cleaner to let people *derive* non-shortest form from 
the explicit exclusions of Table 3-7.



Instead, the exclusion of the sequence  is presented
as a brute definition, rather than as a consequence of 0xD800 not being
a Unicode scalar value. Likewise, 0xFC fails to be legal because it
would define either a 'non-shortest form' or a value that is not a
Unicode scalar value.


Actually 0xFC fails quite simply and unambiguously, because it is not in 
Table 3-7. End of story.


Same for 0xFF. There is nothing architecturally special about 
0xF5..0xFF. All are simply and unambiguously excluded from any 
well-formed UTF-8 byte sequence.




The differences are a matter of presentation; the outcome as to what is
permitted is the same.  The difference lies rather in whether the rules
are comprehensible.  A comprehensible definition is more likely to be
implemented correctly.  Where the presentation makes a difference is in
how malformed sequences are naturally handled.


Well, I don't think implementers have all that much trouble figuring out 
what *well-formed* UTF-8 is these days.


As for "how malformed sequences are naturally handled", I can't really 
say. Nor do I think the standard actually requires any particular 
handling to be conformant. It says thou shalt not emit them, and if you 
encounter them, thou shalt not interpret them as Unicode characters. 
Beyond that, it would be nice, of course, if people converged their 
error handling for malformed sequences in cooperative ways, but there is 
no conformance statement to that effect in the standard.


I have no trouble with the contention that the wording about "best 
practice" and "recommendations" regarding the handling of U+FFFD has 
caused some confusion and differences of interpretation among 
implementers. I'm sure the language in that area could use cleanup, 
precisely because it has led to contending, incompatible interpretations 
of the text. As to what actually *is* best practice in use of U+FFFD 
when attempting to convert ill-formed sequences handed off to UTF-8 
conversion processes, or whether the Unicode Standard should attempt to 
narrow down or change practice in that area, I am completely agnostic. 
Back to the U+FFFD thread for that discussion.


--Ken



Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Ken Whistler via Unicode


On 6/1/2017 6:21 PM, Richard Wordingham via Unicode wrote:

By definition D39b, either sequence of bytes, if encountered by an
conformant UTF-8 conversion process, would be interpreted as a
sequence of 6 maximal subparts of an ill-formed subsequence.

("D39b" is a typo for "D93b".)


Sorry about that. :)



Conformant with what?  There is no mandatory*requirement*  for a UTF-8
conversion process conformant with Unicode to have any concept of
'maximal subpart'.


Conformant with the definition of UTF-8. I agree that nothing forces a 
conversion *process* to care anything about maximal subparts, but if 
*any* process using a conformant definition of UTF-8 then goes on to 
have any concept of "maximal subpart of an ill-formed subsequence" that 
departs from definition D93b in the Unicode Standard, then it is just 
making s**t up.





I don't see a good reason to build in special logic to treat FC 80 80
80 80 80 as somehow privileged as a unit for conversion fallback,
simply because*if*  UTF-8 were defined as the Unix gods intended
(which it ain't no longer) then that sequence*could*  be interpreted
as an out-of-bounds scalar value (which it ain't) on spec that the
codespace*might*  be extended past 10 at some indefinite time in
the future (which it won't).

Arguably, it requires special logic to treat FC 80 80 80 80 80 as an
invalid sequence.


That would be equally true of FF FF FF FF FF FF. Which was my point, 
actually.



   FC is not ASCII,


True, of course. But irrelevant. Because we are talking about UTF-8 
here. And just because some non-UTF-8 character encoding happened to 
include 0xFC as a valid (or invalid) value, might not require any 
special case processing. A simple 8-bit to 8-bit conversion table could 
be completely regular in its processing of 0xFC for a conversion.



  and has more than one leading bit
set.  It has the six leading bits set,


True, of course.


  and therefore should start a
sequence of 6 characters.


That is completely false, and has nothing to do with the current 
definition of UTF-8.


The current, normative definition of UTF-8, in the Unicode Standard, and 
in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly "obsoletes and 
replaces RFC 2279") states clearly that 0xFC cannot start a sequence of 
anything identifiable as UTF-8.


--Ken



Richard.





Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Ken Whistler via Unicode


On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote:

You were implicitly invited to argue that there was no need to handle
5 and 6 byte invalid sequences.



Well, working from the *current* specification:

FC 80 80 80 80 80
and
FF FF FF FF FF FF

are equal trash, uninterpretable as *anything* in UTF-8.

By definition D39b, either sequence of bytes, if encountered by an 
conformant UTF-8 conversion process, would be interpreted as a sequence 
of 6 maximal subparts of an ill-formed subsequence. Whatever your 
particular strategy for conversion fallbacks for uninterpretable 
sequences, it ought to treat either one of those trash sequences the 
same, in my book.


I don't see a good reason to build in special logic to treat FC 80 80 80 
80 80 as somehow privileged as a unit for conversion fallback, simply 
because *if* UTF-8 were defined as the Unix gods intended (which it 
ain't no longer) then that sequence *could* be interpreted as an 
out-of-bounds scalar value (which it ain't) on spec that the codespace 
*might* be extended past 10 at some indefinite time in the future 
(which it won't).


--Ken


Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-26 Thread Ken Whistler via Unicode


On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:

The link provided about the PRI doesn't lead to the comments.



PRI #121 (August, 2008) pre-dated the practice of keeping all the 
feedback comments together with the PRI itself in a numbered directory 
with the name "feedback.html". But the comments were collected together 
at the time and are accessible here:


http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121

Also there was a separately submitted comment document:

http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt

And the minutes of the pertinent UTC meeting (UTC #116):

http://www.unicode.org/L2/L2008/08253.htm

The minutes simply capture the consensus to adopt Option #2 from PRI 
#121, and the relevant action items.


I now return the floor to the distinguished disputants to continue 
litigating history. ;-)


--Ken







Re: Comparing Raw Values of the Age Property

2017-05-23 Thread Ken Whistler via Unicode

Richard


On 5/23/2017 1:48 PM, Richard Wordingham via Unicode wrote:

The object is to generate code*now*  that, up to say Unicode Version 23.0,
can work out, from the UCD files DerivedAge.txt and
PropertyValueAliases.txt, whether an arbitrary code point was included
by some Unicode version identified by a Unicode version identified by a
value of the property Age.


Ah, but keep in mind, if projecting out to Version 23.0 (in the year 
2030, by our current schedule), there is a significant chance that 
particular UCD data files may have morphed into something entirely 
different. Recall how at one point Unihan.txt morphed into Unihan.zip 
with multiple subpart files. Even though the maintainers of the UCD data 
files do our best to maintain them to be as stable as possible, their 
content and sometimes their formats do morph gradually from release to 
release. Just don't expect *any* parser to be completely forward proofed 
against what *might* happen in the UCD in some future version.


On the other hand, for the property Age, even in the absence of 
normative definitions of invariants for the property values, given 
recent practice, it is pretty damn safe to assume:


A. Major versions will continue to have two digits, incremented by one 
for each subsequent version: 10, 11, 12, ... 99.
B. Minor versions will mostly (if not entirely) consist of the value 
"0", and will never require two digits.


Assumption A will get you through this century, which by my estimation 
should well exceed the lifetime of any code you might be writing now 
that depends on it.


BTW, unlike many actual products, the version numbering of the Unicode 
Standard is not really driven by marketing concerns. So there is very 
little chance of some version sequence for Unicode that ends up fitting 
a pattern like: 3.0, 3.1, 95 or NT, 98, 2000, XP, Vista, 7, 8, 8.1, 10 
... ;-)




What TUS 9.0, its appendices and annexes is lacking is a clear
statement such as, "The short values for the Age property are of the
form "m.n", with the first field corresponding to the major version,
and the second field corresponding to the minor version. There is no
need for a third version field, because new characters are never
assigned in update versions of the standard."


I think the UTC and the editors had just been assuming that the pattern 
was so obvious that it needed no explaining. But the lack of a clear 
description of Age had become apparent, which is why I wrote that text 
to add to UAX #44 for the upcoming version.



  Conveniently, this
almost true statement is included in Section 5.14 of the proposed
update to UAX#44 (in Draft 12 to be precise.  It's not quite true, for
there is also the short value NA for Unassigned.  Is there any way of
formally recording this oversight?


Yes. You could always file another piece of feedback using the contact 
form. However, in this case, you already have the attention of the 
editors of UAX #44. So my advice would be to simply wait now for the 
publication of Version 10.0 of UAX #44 around the 3rd week of June.


--Ken




Re: English flag (from Re: How to Add Beams to Notes)

2017-05-03 Thread Ken Whistler via Unicode


On 5/3/2017 3:20 AM, William_J_G Overington via Unicode wrote:

Surely a single code point could be found. Single code points are being found 
for various emoji items on a continuing basis. Why pull up the ladder on 
encoding some flags each with a single code point?

Yes, a single code point for an English flag please. And one for a Welsh flag 
too please. And one for a Scottish flag too please. And some others please, if 
that is what end users want.


I suggest the following:

10BEDE for an English flag (reminding one of Bede the Venerable)
10CADF for a Welsh flag (harking to Cadfan ap Iago, King of Gwynedd)
10A1BA for a Scottish flag (for Alba, of course)

Surely those would work for you!

--Ken