Re: Why do binary files contain text but text files don't contain binary?
On 2/21/2020 7:53 AM, Costello, Roger L. via Unicode wrote: Text files may indeed contain binary (i.e., bytes that are not interpretable as characters). Namely, text files may contain newlines, tabs, and some other invisible things. Question: "characters" are defined as only the visible things, right? No. You've gone astray right there. Please read Chapter 2 of the Unicode Standard, and in particular, Section 2.4, Code Points and Characters: https://www.unicode.org/versions/Unicode12.0.0/ch02.pdf#G25564 All of those types of characters can occur in Unicode plain text. (With the exception of surrogate code points.) I conclude: Binary files may contain arbitrary text. Binary files can contain *whatever*, including text. Text files may contain binary, but only a restricted set of binary. The distinction is definitional. A text file contains *only* characters, interpretable by a specific character encoding (usually Unicode, these days). But a text file need not be "plain text". An HTML file is an example of a text file (it contains only a sequence of characters, whose identity and interpretation is all clearly specified by looking them up in the Unicode Standard), but it is not *plain* text. It is *rich* text, consisting of markup tags interspersed with runs of plain text. Another distinction that may be leading you astray is the distinction between binary file transfer and text file transfer. If you are using ftp, for example, you can specify use of binary file transfer, *even if* the file you are transferring is actually a text file. That simply means that the file transfer will agree to treat the entire file as a binary blob and transfer it byte-for-byte intact. A text file transfer, on the other hand, may look for "lines" in a text file and may adjust line endings to suit the receiving platform conventions. Do you agree? No. --Ken
Re: Egyptian Hieroglyph Man with a Laptop
Well, no, in this case "strange" means strange, as Ken Lunde notes. I'm just pointing to his list, because it pulls together quite a few Han characters that *also* have dubious cases for encoding. Or you could turn the argument around, I suppose, and note that just because the hieroglyph for "Egyptologist" is strange, that doesn't necessarily mean that the case for encoding it is dubious. ;-) --Ken On 2/13/2020 3:47 PM, j...@koremail.com wrote: An interesting comparison, if strange means dubious, then the name kstrange should be changed or some of the content removed because many of the characters in the set are not dubious in the least.
Re: Egyptian Hieroglyph Man with a Laptop
You want "dubious"?! You should see the hundreds of strange characters already encoded in the CJK *Unified* Ideographs blocks, as recently documented in great detail by Ken Lunde: https://www.unicode.org/L2/L2020/20059-unihan-kstrange-update.pdf Compared to many of those, a hieroglyph of a man (or woman) holding a laptop is positively orthodox! --Ken On 2/13/2020 11:47 AM, Phake Nick via Unicode wrote: Those characters could also be put into another block for the same script similar to how dubious characters in CJK are included by placing them into "CJK Compatibility Ideographs" for round trip compatibility with source encoding.
Re: Combining Marks and Variation Selectors
Richard, What it comes down to is avoidance of conundrums involving canonical reordering for normalization. The effect of variation selectors is defined in terms of an immediate adjacency. If you allowed variation selectors to be defined for combining marks of ccc!=0, then normalization of sequences could, in principle, move the two apart. That would make implementation of the intended rendering much more difficult. That is basically why the UTC, from the start, ruled out using variation selectors to try to make graphic distinctions between different styles of acute accent marks explicit, for example. --Ken On 2/1/2020 7:30 PM, Richard Wordingham via Unicode wrote: Ah, I missed that change from Version 5.0, where the restriction was, 'The base character in a variation sequence is never a combining character or a decomposable character'. I now need to rephrase the question. Why are marks other than spacing marks prohibited?
Re: Adding Experimental Control Characters for Tai Tham
Richard, Given that those particular two variation selectors have already given very specific semantics for emoji sequences, and would now be expected to occur *only* in emoji sequences: https://www.unicode.org/reports/tr51/#def_text_presentation_selector usurping them to do something unrelated would probably not be a good idea. For experimentation purposes, VS13 and VS14 would be safer. --Ken On 1/25/2020 10:41 AM, Richard Wordingham via Unicode wrote: How inappropriate would it be to usurp a pair of variation selectors for this purpose? For mnemonic purposes, I would suggest usurping FE0E VARIATION SELECTOR-15 for *1A8E TAI THAM SIGN INITIAL FE0F VARIATION SELECTOR-16 for *1A8F TAI THAM SIGN FINAL
Re: Not accepted by UTC but in ISO ballot?
Shriramana, That category is used to track character(s) in process that may have been approved by WG2 but are not yet in ballot, or are in contention, and may have just been dropped from ballot, but which still have sufficient visibility to be tracked. The process is a bit rough around the edges when dealing with two separate committees with asynchronous processes and not all of whose members have unanimous agreement about what they are moving forward on. The pipeline is a means of tracking various status as the committees work to synchronize their eventual publications of new repertoire. --Ken On 12/27/2019 7:06 AM, Shriramana Sharma via Unicode wrote: Now I'm wondering about the similar category "not accepted by UTC, and not in ISO ballot" – why such a character would be mentioned on the pipeline at all…
Re: Not accepted by UTC but in ISO ballot?
Shriramana, On 12/20/2019 6:29 PM, Shriramana Sharma via Unicode wrote: I was looking at the pipeline for something else, and for the first time I see a character category: “not accepted by the UTC but in ISO ballot” and two characters in it. Those two characters changed status as of December 4, when the disposition of comments for CD3 was posted. They will not be part of the DIS ballot. The pipeline has now been updated to reflect that change of status. So IIUC while technically people are free to submit a document to the ISO separately without submitting to UTC, it has always been the practice to my knowledge to get a character approved by the UTC first. That is a preferred process, but doesn't always occur. The most obvious exception is that large new CJK repertoire additions are developed by the IRG and often go into ballot in ISO before the UTC takes a formal decision to approve them. CJK Extension G has now been approved for 13.0 by the UTC, but the entire block was listed in the pipeline for some time as "not accepted by UTC, but in active ISO technical ballot" once Extension G went into CD balloting. --Ken
Re: HEAVY EQUALS SIGN
On 12/20/2019 7:17 AM, wjgo_10...@btinternet.com via Unicode wrote: It is indeed interesting that the Notice of Non-Approval itself uses italics for emphasis in two places. That text, at the present time, cannot be expressed in Unicode plain text with the emphasis that the Notice of Non-Approval includes. ... which was /precisely /the point. I'm glad you noticed. --Ken
Re: New Public Review on QID emoji
On 10/30/2019 10:41 AM, wjgo_10...@btinternet.com via Unicode wrote: At present I have a question to which I cannot find the answer. Is the QID emoji format, if approved by the Unicode Technical Committee going to be sent to the ISO/IEC 10646 committee for consideration by that committee? No. As the QID emoji format is in a Unicode Technical Standard and does not include the encoding of any new _atomic_ characters, I am concerned that the answer to the above question may well be along the lines of "No" maybe with some reasoning as to why not. As you surmised. Yet will a QID emoji essentially be _de facto_ a character even if not _de jure_ a character? That distinction is effectively meaningless. There are any number of entities that end users perceive as "characters", which are not represented by a single code point in the Unicode Standard (or 10646) -- and this has been the case now for decades. Yet if QID emoji are implemented by Unicode Inc. without also being implemented by ISO/IEC 10646 then that could lead to future problems, notwithstanding any _de jure_ situation that QID emoji are not characters, because they will be much more than Private Use characters yet less than characters that are in ISO/IEC 10646. What you are missing is that *many* emoji are already represented by sequences of characters. See emoji modifier sequences, emoji flag sequences, emoji ZWJ sequences. *None* of those are specified in 10646, have not been for years now, and never will be. And yet, there is no de jure standardization crisis here, or any interoperability issue for emoji arising from that situation. I am in favour of the encoding of the QID emoji mechanism and its practical application. However I wonder about what are the consequences for interoperability and communication if QID emoji become used - maybe quite widely - and yet the tag sequences are not discernable in meaning from ISO/IEC 10646 or any related ISO/IEC documents. There may well be interoperability concerns specifically for the QID emoji mechanism, but that would be an issue pertaining to the architecture of that mechanism specifically. It isn't anything to do with the relationship between the Unicode Standard (and UTS #51) and ISO/IEC 10646. --Ken
Re: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode?
On 10/12/2019 3:15 AM, Fred Brennan via Unicode wrote: There seems to be no conscionable reason for such a long delay after the approval. If that's just how things are done, fine, I certainly can't change the whole system. But imagine if you had to wait two years to even have a chance of using a letter you desperately need to write your language? Imagine if the letter "Q" was unencoded and Noto refused to add it for two more years? Well, as long as we are imagining things, then consider a scenario where the UTC is presented a proposal for encoding a writing system which is reported as an historic artifact of the 18th century, "fallen out of normal use", yet encodes it anyway based on the proposal provided in 1999: https://www.unicode.org/L2/L1999/n1933.pdf and publishes it in Unicode 3.2 in 2002: https://www.unicode.org/standard/supported.html Then imagine that a community works to revive use of that script (now known as Baybayin) and extends character use in it based on similar characters in related, more contemporaneous scripts, but that the first time the UTC actually formally hears about that extension is on July 18, 2019: https://www.unicode.org/L2/L2019/19258r-baybayin-ra.pdf And then imagine that despite a 17 year gap before this supposedly urgent defect in an encoding is reported to the UTC, that the UTC in fact approves encoding of U+170D TAGALOG LETTER RA at its very *first* opportunity, eight days later, on July 26, 2019. Further imagine that the UTC immediately publishes what amounts to a "letter of intent" to publish this character when it can: https://www.unicode.org/alloc/Pipeline.html#future It may then be understandable that some UTC participants might be puzzled to be accused of unconscionable delays in this case. I understand the frustration that you are expressing, but it simply isn't feasible for every proposal's advocates to get their particular candidates pushed to the front of the line for publication. Unicode 13.0 is creaking down the track towards its March 10, 2020 publication, but it already is contending with 5930 new characters (as well as additional emoji sequences beyond that), every one of which was approved by the UTC *prior* to July 26, 2019 and all of which are already in some advanced stage of ISO ballot consideration. In the meantime, Baybayin users are inconvenienced, sure, but it is unlikely that the interim solutions will just break, because nobody is opposed to U+170D TAGALOG LETTER RA, and it is exceedingly unlikely that that code point would be moved before its eventual publication in the standard in March, 2021. --Ken
Re: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode?
Sorry about the typo there. I meant "the published Version 13.0 next March" --Ken On 10/11/2019 10:17 AM, Ken Whistler wrote: then eventually in the published Version 13.0 next month:
Re: Will TAGALOG LETTER RA, currently in the pipeline, be in the next version of Unicode?
Short answer is no. The characters in the pipeline section labeled "Characters Accepted for Version 13.0" are what will be in the beta review for 13.0 (look for that sometime next month), and then eventually in the published Version 13.0 next month: https://www.unicode.org/alloc/Pipeline.html#planned_next_version Characters listed in the "Characters for Future Versions" table: https://www.unicode.org/alloc/Pipeline.html#future are not yet targeted for any particular version. Many of them, including the Tagalog letter RA, will end up published in Unicode 14.0, but the detailed decisions on what makes it into Unicode 14.0 won't happen until sometime next summer. Production of new versions of the Unicode Standard is a ponderous and lengthy operation, involving 4 UTC meetings, uncounted subcommittee meetings, dozens of specifications, hundreds of character properties, thousands of characters, hundreds of fonts, and intricate charts and QA process. It doesn't happen at the drop of a hat, which is why we schedule a full year for each new major release. So, in general, no, you can *never* assume that once the UTC has just approved a new character that it will be in the next version of Unicode. --Ken On 10/11/2019 4:35 AM, Fred Brennan via Unicode wrote: Many users are asking me and I'm not sure of the answer (nor how to find it out). The UTC approved it, so it will be in the next version of Unicode, right? We sure hope so...it is a character needed to write a script in current use. Although only a minority of people care about it, that minority is dedicated! Best, Fred Brennan
Re: On the lack of a SQUARE TB glyph
Fred, 2 hours and 33 minutes from now (today). But you don't need to try to synch a proposal like this to a particular script ad hoc meeting. That group meets roughly once a month, and any new proposal coming in right now wouldn't be on the Unicode 13.0 train, even if the UTC immediately agreed to it. So there isn't an immediately urgent deadline for new proposals. --Ken On 9/26/2019 10:15 PM, Fred Brennan via Unicode wrote: When does the Script Ad Hoc meet next?
Re: On the lack of a SQUARE TB glyph
On 9/26/2019 4:21 AM, Fred Brennan via Unicode wrote: There is a clear demand for a SQUARE TB. In the font SMotoya Sinkai W55 W3, which is ©2008 株式会社 モトヤ, the glyph is unencoded and accessed via the Discretionary Ligatures (`dlig`) OpenType feature. It has name `T_B.dlig`. Aye, there's the rub. Despite the subject of this thread, the problem is not the lack of a "glyph". This and many other particular squared forms may exist in Japanese fonts. The question then devolves to whether there is a *character* encoding issue here. What data representation and interchange issue is being raised here that requires an atomic character encoding, when the *presentation* issue can just be handled with OpenType features and already existing characters? If the concern is about future-proofing the standard, then clearly, instead of indefinitely extending various groups of squared combinations for SI values, other technical values, etc., etc., the generative and scaleable way forward is simply to let Japanese squared sequence coinages be handled with OpenType features, rather than insisting that each one come back to the UTC for one-by-one character encoding. Note that there is a certain, systemic similarity here to the problem of extensibility of emoji, where encoding of multiple flags, of multiple skin tones, or of multiple gender representations, etc., is handled more generally by specifying how fonts need to map specified sequences into single glyphs, rather than by insisting that every meaningful combination end up encoded as an atomic character. --Ken
Re: PUA (BMP) planned characters HTML tables
On 8/14/2019 4:32 PM, James Kass via Unicode wrote: If a character gets deprecated, can its decomposition type be changed from canonical to compatibility? Simple answer: No. --Ken
Re: New website
Your helpful suggestions will be passed along to the people working on the new site. In the meantime, please note that the link to the "Unicode Technical Site" has been added to the left column of quick links in the page bottom banner, so it is easily available now from any page on the new site. --Ken On 7/22/2019 9:54 AM, Zachary Carpenter wrote: It seems that many of the concerns expressed here could be resolved with a menu link to the “Unicode Technical Site” on the left-hand menu bar
Re: Akkha script (used by Eastern Magar language) in ISO 15924?
See the entry for "Magar Akkha" on: http://linguistics.berkeley.edu/sei/scripts-not-encoded.html Anshuman Pandey did preliminary research on this in 2011. http://www.unicode.org/L2/L2011/11144-magar-akkha.pdf It would be premature to assign an ISO 15924 script code, pending the research to determine whether this script should be separately encoded. --Ken On 7/22/2019 9:16 AM, Philippe Verdy via Unicode wrote: According to Ethnolog, the Eastern Magar language (mgp) is written in two scripts: Devanagari and "Akkha". But the "Akkha" script does not seem to have any ISO 15924 code. The Ethnologue currently assigns a private use code (Qabl) for this script. Was the addition delayed due to lack of evidence (even if this language is official in Nepal and India) ? Did the editors of Ethnologue submit an addition request for that script (e.g. for the code "Akkh" or "Akha" ?) Or is it considered unified with another script that could explain why it is not coded ? If this is a variant it could have its own code (like Nastaliq in Arabic). Or may be this is just a subset of another (Sino-Tibetan) script ?
Access to the Unicode technical site (was: Re: Unicode's got a new logo?)
On 7/18/2019 11:50 AM, Steffen Nurpmeso via Unicode wrote: I also decided to enter /L2 directly from now on. For folks wishing to access the UTC document register, Unicode Consortium standards, and so forth, all of those links will be permanently stable. They are not impacted by the rollout of the new home page and its related content. If you need access to the more technical information from the UTC, CLDR-TC, ICU-TC, etc., feel free to bookmark such pages as: https://www.unicode.org/L2/ for the UTC document register. https://www.unicode.org/charts/ for the Unicode code charts index, https://www.unicode.org/versions/latest/ for the latest version of the Unicode Standard, and so forth. All such technical links are stable on the site, and will continue to be stable. For general access to the technical content on the Unicode website, see: https://www.unicode.org/main.html which provides easy link access to all the technical content areas and to the ongoing technical committee work. --Ken
Re: ISO 15924 : missing indication of support for Syriac variants
On 7/17/2019 4:54 PM, Philippe Verdy via Unicode wrote: then the Unicode version (age) used for Hieroglyphs should also be assigned to Hieratic. It is already. In fact the ligatures system for the "cursive" Egyptian Hieratic is so complex (and may also have its own variants showing its progression from Hieroglyphs to Demotic or Old Coptic), that probably Hieratic should no longer be considered "unified" with Hieroglyphs, and its existing ISO 15924 code is then not represented at all in Unicode. It *is* considered unified with Egyptian hieroglyphs, until such time as anyone would make a serious case that the Unicode Standard (and students of the Egyptian hieroglyphs, in both their classic, monumental forms and in hieratic) would be better served by a disunification. Note that *many* cursive forms of scripts are not easily "supported" by out-of-the-box plain text implementations, for obvious reasons. And in the case of Egyptian hieroglyphs, it would probably be a good strategy to first get some experience in implementations/fonts supporting the Unicode 12.0 controls for hieroglyphs, before worrying too much about what does or doesn't work to represent hieratic texts adequately. (Demotic is clearly a different case.) For now ISO 15924 still does not consider Egyptian Hieratic to be "unified" with Egyptian Hieroglyphs; this is not indicated in its descriptive names given in English or French with a suffix like "(cursive variant of Egyptian Hieroglyphs)", *and it has no "Unicode Age" version given, as if it was still not encoded at all by Unicode*, That latter part of that statement (highlighted) is false, as is easily determined by simple inspection of the Egyh entry on: https://www.unicode.org/iso15924/iso15924-codes.html --Ken
Re: Unicode "no-op" Character?
On 7/3/2019 10:47 AM, Sławomir Osipiuk via Unicode wrote: Is my idea impossible, useless, or contradictory? Not at all. What you are proposing is in the realm of higher-level protocols. You could develop such a protocol, and then write processes that honored it, or try to convince others to write processes to honor it. You could use PUA characters, or non-characters, or existing control codes -- the implications for use of any of those would be slightly different, in practice, but in any case would be an HLP. But your idea is not a feasible part of the Unicode Standard. There are no "discardable" characters in Unicode -- *by definition*. The discussion of "ignorable" characters in the standard is nuanced and complicated, because there are some characters which are carefully designed to be transparent to some, well-specified processes, but not to others. But no characters in the standard are (or can be) ignorable by *all* processes, nor can a "discardable" character ever be defined as part of the standard. The fact that there are a myriad of processes implemented (and distributed who knows where) that do 7-bit ASCII (or 8-bit 8859-1) conversion to/from UTF-16 by integral type conversion is a simple existence proof that U+000F is never, ever, ever, ever going to be defined to be "discardable" in the Unicode Standard. --Ken
Re: acute-macron hybrid?
On 4/30/2019 12:45 AM, Julian Bradfield via Unicode wrote: What is its appropriate Unicode representation? A macron. --Ken
Re: Variation Sequences (and L2-11/059)
On 3/13/2019 2:42 AM, Janusz S. Bień via Unicode wrote: Hi! On Mon, Jul 16 2018 at 7:07 +02, Janusz S. Bień via Unicode wrote: FAQ (http://unicode.org/faq/vs.html) states: For historic scripts, the variation sequence provides a useful tool, because it can show mistaken or nonce glyphs and relate them to the base character. It can also be used to reflect the views of scholars, who may see the relation between the glyphs and base characters differently. Also, new variation sequences can be added for new variant appearances (and their relation to the base characters) as more evidence is discovered. I'm proof-reading a paper where I quote the above fragment and to my surprise I noticed it's no longer present in the FAQ. That text is, in fact, still present on the FAQ page in question: https://www.unicode.org/faq/vs.html#18 So my question are: 1. Does the change mean the change of the official policy of the Consortium? Your premise here, however, is mistaken. The FAQ pages do *not*, and never have represented official policy of the Unicode Consortium. The individual FAQ entries are contributed by many people -- some attributed, and some not. They are updated or added to periodically by various editors, in response to feedback, or as old entries grow out-dated, or new issues arise. Those updates are editorial, and do not reflect any official decision process by Unicode technical committees or officers. The FAQ main page itself points out that "The FAQs are contributed by many people," and invites the public to submit possible new entries for editing and addition to the list of FAQs. For official technical content, refer to the published technical specifications themselves, which are carefully controlled, versioned, and archived. For official policies of the Unicode Consortium, refer to the Unicode Consortium policies page, which is also carefully controlled: https://www.unicode.org/policies/policies.html 2. Are the archival versions of the FAQ available somewhere? https://web.archive.org/web/*/https://www.unicode.org/faq/ 3. Are the changes to the FAQ documented somehow (a version control system?)? No. --Ken
Re: Bidi paragraph direction in terminal emulators
Egmont, On 2/9/2019 11:48 AM, Egmont Koblinger via Unicode wrote: Are there any (non-CJK) scripts for which crossword puzzles don't exist? There are crossword puzzles for Hindi (in the Devanagari script). Just do an image search for "Hindi crossword puzzle". But the conventions for these break up words into syllables fitting into the boxes, and the rules for that are complex. You have to allow for the placement of dependent vowels, which may take up extra space left or right, as well as consonant clusters, which would be expressed often as conjuncts in Sanskrit, but which in Hindi are more commonly rendered as dead consonant sequences. So the "stuff in a box" is: 1. Inherently proportional width. 2. Inherently multi-character in content. (underlying 1 to 3 or more characters per cell) This is the kind of compromise you would have to have to make for almost any Indic script, to enable a rational approach to building crossword puzzles that make sense. And in a terminal context, you probably would not get acceptable behavior for Hindi if you tried to just take all the "stuff in a box" chunks and tried to lay them out directly in a line, as if the script behaved more like CJK. The existence proof of techniques to cut up text into syllables that enable crossword puzzle building, is not the same as a determination that the script, ipso facto, would work in a terminal context without dealing with additional complex script issues. At any rate, this is once again straying over into the issue of whether terminals can be adapted for the requirements of shaping rules for complex scripts -- rather than the nominal subject of the thread, which has to do with bidi text layout in terminals. --Ken
Re: Proposal for BiDi in terminal emulators
Richard, On 2/1/2019 1:30 PM, Richard Wordingham via Unicode wrote: Language tagging is already available in Unicode, via the tag characters in the deprecated plane. Recte: 1. Plane 14 is not a "deprecated plane". 2. The tag characters in Tag Character block (U+E..U+E007F) are not deprecated. (They are used, for example, by UTS #51 to specify emoji tag sequences.) 3. However, the use of U+E0001 LANGUAGE TAG and the mechanism of using tag characters for spelling out language tags are explicitly deprecated by the standard. See: "Deprecated Use for Language Tagging" in Section 23.9 Tag Characters. https://www.unicode.org/versions/Unicode11.0.0/ch23.pdf#G30427 and PropList.txt: E0001 ; Deprecated # Cf LANGUAGE TAG As I stated earlier: language tags should use BCP 47, and belong in the markup level, not in the plain text stream. --Ken
Re: Proposal for BiDi in terminal emulators
On 1/31/2019 1:41 AM, Egmont Koblinger via Unicode wrote: I mean, for example we can introduce control characters that specify the language. That is a complete non-starter for the Unicode Standard. And if the terminal implementation introduces such as one-off hacks, they will fail completely for interoperability. https://en.wikipedia.org/wiki/IETF_language_tag That belongs to the markup level, not to the plain text stream. --Ken
Re: A last missing link for interoperable representation
James, On 1/8/2019 1:11 PM, James Kass via Unicode wrote: But we're still using typewriter kludges to represent stress in Latin script because there is no Unicode plain text solution. O.k., that one needs a response. We are still using kludges to represent stress in the Latin script because *orthographies* for most languages customarily written with the Latin script don't have clear conventions for indicating stress as a part of the orthography. When an orthography has a well-developed convention for indicating stress, then we can look at how that convention is represented in the plain text representation of that orthography. An obvious case is notational systems for the representation of pronunciation of English words in dictionaries. Those conventions *do* then have plain text representations in Unicode, because, well, they just have various additional characters and/or combining marks to clearly indicate lexical stress. But standard written English orthography does *not*. (BTW, that is in part because marking stress in written English would usually *decrease* legibility and the usefulness of the writing, rather than improving it.) Furthermore, there is nothing inherent about *stress* per se in the Latin script (or any other script, for that matter). Lexical stress is a phonological system, not shared or structured the same way in all languages. And there are *thousands* of languages written with the Latin script -- with all kinds of phonological systems associated with them. Some have lexical tones, some do not. Some have other kinds of phonological accentuation systems that don't count as lexical stress, per se. And there are differences between lexical stress (and its indication), and other kinds of "stress". Contrastive stress, which is way more interesting to consider as a part of writing, IMO, than lexical stress, is a *prosodic* phenomenon, not a lexical one. (And I have been using the email convention of asterisks here to indicate contrastive stress in multiple instances.) And contrastive stress is far from the only kind of communicatively significant pitch phenomenon in speech that typically isn't formally represented in standard orthographies. There are numerous complex scoring systems for linguistic prosody that have been developed by linguists interested in those phenomenon -- which include issues of pace and rhythm, and not merely pitch contours and loudness. It isn't the job of the Unicode Consortium or the Unicode Standard to sort that stuff out or to standardize characters to represent it. When somebody brings to the UTC written examples of established orthographies using character conventions that cannot be clearly conveyed in plain text with the Unicode characters we already have, *then* perhaps we will have something to talk about. --Ken
Re: The encoding of the Welsh flag
Michael, On 11/21/2018 9:38 AM, Michael Everson via Unicode wrote: What really annoys me about this is that there is no flag for Northern Ireland. The folks at CLDR did not think to ask either the UK or the Irish representatives to SC2 about this. Neither CLDR-TC nor SC2 has any jurisdiction here, so this is rather non sequitur. If you or Andrew West or anyone else is interested in pursuing an emoji tag sequence for an emoji flag for Northern Ireland, then that should be done by submitting a proposal, with justification, to the Emoji Subcommittee, which *does* have jurisdiction. https://unicode.org/emoji/proposals.html See in particular, Section M of the selection criteria. --Ken
Re: The encoding of the Welsh flag
On 11/21/2018 8:00 AM, William_J_G Overington via Unicode wrote: Yet the interoperability does not derive from an International Standard. The interoperability that enabled your mail to be delivered to me derives in part from the MIME standard (RFC 2045 et seq.) which is not an International Standard, but is instead maintained by the Networking Working Group of IETF. The interoperability that enabled me to read the content of your mail derives from the HTML standard, which is not an International Standard, but is instead maintained by the W3C (a consortium). The interoperability of any flag emoji embedded in that content derives from Unicode Technical Standard #51, which is not an International Standard, but is instead maintained by the Unicode Consortium. These standards are all widely used *internationally*, but they are not an International Standard, which is effectively a moniker claimed by ISO for itself and its standards. But in this day and age, expecting all technology, including technology related to computational processing, distribution, interchange, and rendering of text, to wait around for any related standard to be canonized as an International Standard is just silly. The world of technology does not work that way, and frankly, folks should be damn glad that it doesn't. --Ken
Re: The encoding of the Welsh flag
On 11/20/2018 12:57 PM, William_J_G Overington via Unicode wrote: quote A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS. end quote My questions are as follows please. Is that encoding for the Welsh flag included in both The Unicode Standard and ISO/IEC 10646 or is it only encoded in The Unicode Standard or is it in neither The Unicode Standard nor ISO/IEC 10646? Neither. A flag emoji is represented via a character sequence -- in this particular case by an emoji tag sequence, as specified in UTS #51. The representation of flag emoji via emoji tag sequences is *OUT OF SCOPE* for both the Unicode Standard and for ISO/IEC 10646. If you find that hard to understand, consider another example. The spelling of the word "emoji" as the sequence of Unicode characters <0065, 006D, 006F, 006A, 0069> is also *OUT OF SCOPE* for both the Unicode Standard and for ISO/IEC 10646. Neither standard specifies English spelling rules; nor does either standard specify flag emoji "spelling rules". Unless the answer is the first listed possibility, how does that work as regards interoperability of sending and receiving a Welsh flag on an electronic communication system? One declares conformance to UTS #51 and declares the version of emoji that one's application supports -- including the RGI (recommended for general interchange) list of emoji one has input and display support for. If the declaration states support for the flags of England, Scotland, and Wales, then one must do so via the specified emoji tag sequences. Your interoperability derives from that. --Ken
Re: UCA unnecessary collation weight 0000
On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote: I was replying not about the notational repreentation of the DUCET data table (using [....] unnecessarily) but about the text of UTR#10 itself. Which remains highly confusive, and contains completely unnecesary steps, and just complicates things with absoiluytely no benefit at all by introducing confusion about these "". Sorry, Philippe, but the confusion that I am seeing introduced is what you are introducing to the unicode list in the course of this discussion. UTR#10 still does not explicitly state that its use of "" does not mean it is a valid "weight", it's a notation only No, it is explicitly a valid weight. And it is explicitly and normatively referred to in the specification of the algorithm. See UTS10-D8 (and subsequent definitions), which explicitly depend on a definition of "A collation weight whose value is zero." The entire statement of what are primary, secondary, tertiary, etc. collation elements depends on that definition. And see the tables in Section 3.2, which also depend on those definitions. (but the notation is used for TWO distinct purposes: one is for presenting the notation format used in the DUCET It is *not* just a notation format used in the DUCET -- it is part of the normative definitional structure of the algorithm, which then percolates down into further definitions and rules and the steps of the algorithm. itself to present how collation elements are structured, the other one is for marking the presence of a possible, but not always required, encoding of an explicit level separator for encoding sort keys). That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It is not part of the *notation* for collation elements, but instead is a magic value chosen for the level separator precisely because zero values from the collation elements are removed during sort key construction, so that zero is then guaranteed to be a lower value than any remaining weight added to the sort key under construction. This part of the algorithm is not rocket science, by the way! UTR#10 is still needlessly confusive. O.k., if you think so, you then know what to do: https://www.unicode.org/review/pri385/ and https://www.unicode.org/reporting.html Even the example tables can be made without using these "" (for example in tables showing how to build sort keys, it can present the list of weights splitted in separate columns, one column per level, without any "". The implementation does not necessarily have to create a buffer containing all weight values in a row, when separate buffers for each level is far superior (and even more efficient as it can save space in memory). The UCA doesn't *require* you to do anything particular in your own implementation, other than come up with the same results for string comparisons. That is clearly stated in the conformance clause of UTS #10. https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance The step "S3.2" in the UCA algorithm should not even be there (it is made in favor an specific implementation which is not even efficient or optimal), That is a false statement. Step S3.2 is there to provide a clear statement of the algorithm, to guarantee correct results for string comparison. Section 9 of UTS #10 provides a whole lunch buffet of techniques that implementations can choose from to increase the efficiency of their implementations, as they deem appropriate. You are free to implement as you choose -- including techniques that do not require any level separators. You are, however, duly warned in: https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators that "While this technique is relatively easy to implement, it can interfere with other compression methods." it complicates the algorithm with absoluytely no benefit at all); you can ALWAYS remove it completely and this still generates equivalent results. No you cannot ALWAYS remove it completely. Whether or not your implementation can do so, depends on what other techniques you may be using to increase performance, store shorter keys, or whatever else may be at stake in your optimization. If you don't like zeroes in collation, be my guest, and ignore them completely. Take them out of your tables, and don't use level separators. Just make sure you end up with conformant result for comparison of strings when you are done. And in the meantime, if you want to complain about the text of the specification of UTS #10, then provide carefully worded alternatives as suggestions for improvement to the text, rather than just endlessly ranting about how the standard is confusive because the collation weight is "unnecessary". --Ken
Re: A sign/abbreviation for "magister"
On 10/30/2018 2:32 PM, James Kass via Unicode wrote: but we can't seem to agree on how to encode its abbreviation. For what it's worth, "mgr" seems to be the usual abbreviation in Polish for it. --Ken
Re: A sign/abbreviation for "magister"
On 10/29/2018 8:06 PM, James Kass via Unicode wrote: could be typed on old-style mechanical typewriters. Quintessential plain-text, that. Nope. Typewriters were regularly used for underscoring and for strikethrough, both of which are *styling* of text, and not plain text. The mere fact that some visual aspect of graphic representation on a page of paper can be implemented via a mechanical typewriter does not, ipso facto, mean that particular feature is plain text. The fact that I could also implement superscripting and subscripting on a mechanical typewriter via turning the platen up and down half a line, also does not make *those* aspects of text styling plain text. either. The same reasoning applies to handwriting, only more so. --Ken
Re: Dealing with Georgian capitalization in programming languages
Martin, On 10/9/2018 12:47 AM, Martin J. Dürst via Unicode wrote: - Using the 'capitalize' method to (try to) get the titlecase property of a MTAVRULI character. (There's no other way currently in Ruby to get the titlecase property.) There may be others. If you have some ideas, I'd appreciate to know about them. This lets me wonder why the UTC didn't simply declare the titlecase property of MTAVRULI to be mkhedruli. Was this considered or not? The way things are currently set up, there seems to be no benefit of MTAVRULI being its own titlecase, because in actual use, that requires additional processing. Titlecasing for Georgian was not completely thought through before Mtavruli was added. As I noted in my earlier comment on this thread, the titlecase mapping values for Mkhredruli were added late in the process, when it became clear that not doing so would result in inappropriate outcomes for existing Mkhredruli text. I don't think there is a fully-worked out position on this, but adding a Simple_Titlecase mapping for Mtavruli to Mkhedruli would, I suspect, just further muddy waters for implementers, because it would be in effect saying that an uppercase letter titlecases by shifting to its lowercase mapping. A headscratcher, at the very least. Note that with the current mappings as they are, Changes_When_Titlecased is False for all Mkhedruli and for all Mtavruli characters, which I think is the desired state of affairs. A titlecasing string operation of Mtavruli that does something other than just leave the string alone should, IMO, be documented as doing something extra and *should* have to do additional processing. --Ken
Re: Dealing with Georgian capitalization in programming languages
On 10/2/2018 12:45 AM, Martin J. Dürst via Unicode wrote: capitalize: uppercase (or title-case) the first character of the string, lowercase the rest When I say "cause problems", I mean producing mixed-case output. I originally thought that 'capitalize' would be fine. It is fine for lowercase input: I stays lowercase because Unicode Data indicates that titlecase for lowercase Georgian letters is the letter itself. But it will produce the apparently undesirable Mixed Case for ALL UPPERCASE input. My questions here are: - Has this been considered when Georgian Mtavruli was discussed in the UTC? Not explicitly, that I recall. The whole issue of titlecasing came up very late in the preparation of case mapping tables for Mtavruli and Mkhedruli for 11.0. But it seems to me that the problem you are citing can be avoided if you simply rethink what your "capitalize" means. It really should be conceived of as first lowercasing the *entire* string, and then titlecasing the *eligible* letters -- i.e., usually the first letter. (Note that this allows for the concept that titlecasing might then be localized on a per-writing-system basis -- the issue would devolve to determining what the rules are for "eligible" letters.) But the simple default would just be to titlecase the initial letter of each "word" segment of a string. Note that conceived this way, for the Georgian mappings, where the titlecase mapping for Mkhedruli is simply the letter itself, this approach ends up with: capitalize(mkhedrulistring) --> mkhedrulistring capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> mkhedrulistring Thus avoiding any mixed case. --Ken
Re: UCD in XML or in CSV?
On 8/31/2018 1:36 AM, Manuel Strehl via Unicode wrote: For codepoints.net I use that data to stuff everything in a MySQL database. Well, for some sense of "everything", anyway. ;-) People having this discussion should keep in mind a few significant points. First, the UCD proper isn't "everything", extensive as it is. There are also other significant sets of data that the UTC maintains about characters in other formats, as well, including the data files associated with UTS #46 (IDNA-related), UTS #39 (confusables mapping, etc.), UTS #10 (collation), UTR #25 (a set of math-related property values), and UTS #51 (emoji-related). The emoji-related data has now strayed into the CLDR space, so a significant amount of the information about emoji characters is now carried as CLDR tags. And then there is various other information about individual characters (or small sets of characters) scattered in the core spec -- some in tables, some not, as well as mappings to dozens of external standards. There is no actual definition anywhere of what "everything" actually is. Further, it is a mistake to assume that every character property just associates a simple attribute with a code point. There are multiple types of mappings, complex relational and set properties, and so forth. The UTC attempts to keep a fairly clear line around what constitutes the "UCD proper" (including Unihan.zip), in part so that it is actually possible to run the tools that create the XML version of the UCD, for folks who want to consume a more consistent, single-file format version of the data. But be aware that that isn't everything -- nor would there be much sense in trying to keep expanding the UCD proper to actually represent "everything" in one giant DTD. Second, one of the main obligations of a standards organization is *stability*. People may well object to the ad hoc nature of the UCD data files that have been added over the years -- but it is a *stable* ad-hockery. The worst thing the UTC could do, IMO, would be to keep tweaking formats of data files to meet complaints about one particular parsing inconvenience or another. That would create multiple points of discontinuity between versions -- worse than just having to deal with the ongoing growth in the number of assigned characters and the occasional addition of new data files and properties to the UCD. Keep in mind that there is more to processing the UCD than just "latest". People who just focus on grabbing the very latest version of the UCD and updating whatever application they have are missing half the problem. There are multiple tools out there that parse and use multiple *versions* of the UCD. That includes the tooling that is used to maintain the UCD (which parses *all* versions), and the tooling that creates UCD in XML, which also parses all versions. Then there is tooling like unibook, to produce code charts, which also has to adapt to multiple versions, and bidi reference code, which also reads multiple versions of UCD data files. Those are just examples I know off the top of my head. I am sure there are many other instances out there that fit this profile. And none of the applications already built to handle multiple versions would welcome having to permanently build in tracking particular format anomalies between specific versions of the UCD. Third, please remember that folks who come here complaining about the complications of parsing the UCD are a very small percentage of a very small percentage of a very small percentage of interested parties. Nearly everybody who needs UCD data should be consuming it as a secondary source (e.g. for reference via codepoints.net), or as a tertiary source (behind specialized API's, regex, etc.), or as an end user (just getting behavior they expect for characters in applications). Programmers who actually *need* to consume the raw UCD data files and write parsers for them directly should actually be able to deal with the format complexity -- and, if anything, slowing them down to make them think about the reasons for the format complexity might be a good thing, as it tends to put the lie to the easy initial assumption that the UCD is nothing more than a bunch of simple attributes for all the code points. --Ken
Re: Private Use areas
On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote: On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote: On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: Is there a block of RTL PUA also? No. Perhaps there should be? This is a periodic suggestion that never goes anywhere--for good reason. (You can search the email archives and see that it keeps coming up.) Presuming that this question was asked in good faith... What about designating a part of the PUA to have a specific property? The problem with that is that assigning *any* non-default property to any PUA code point would break existing implementations' assumptions about PUA character properties and potentially create havoc with existing use. Only certain properties matter enough: That is an un-demonstrated assertion that I don't think you have thought through sufficiently. * wide * RTL RTL is not some binary counterpart of LTR. There are 23 values of Bidi_Class, and anyone who wanted to implement a right-to-left script in PUA might well have to make use of multiple values of Bidi_Class. Also, there are two major types of strong right-to-leftness: Bidi_Class=R and Bidi_Class=AL. Should a "RTL PUA" zone favor Arabic type behavior or non-Arabic type behavior? * combining Also not a binary switch. Canonical_Combining_Class is a numeric value, and any value but ccc=0 for a PUA character would break normalization. Then for the General_Category, there are three types of "marks" that count as combining: gc=Mn, gc=Mc, gc=Me. Which of those would be favored in any PUA assignment? as most others are better represented in the font itself. Really? Suppose someone wants to implement a bicameral script in PUA. They would need case mappings for that, and how would those be "better represented in the font itself"? Or how about digits? Would numeric values for digits be "better represented in the font itself"? How about implementation of punctuation? Would segmentation properties and behavior be "better represented in the font itself"? This could be done either by parceling one of existing PUA ranges: planes 15 and 16 are virtually unused thus any damage would be negligible; That is simply an assertion -- and not the kind of assertion that the UTC tends to accept on spec. I rather suspect that there are multiple participants on this email list, for example, who *do* have implementations making extensive use of Planes 15/16 PUA code points for one thing or another. or perhaps by allocating a new range elsewhere. See: https://www.unicode.org/policies/stability_policy.html The General_Category property value Private_Use (Co) is immutable: the set of code points with that value will never change. That guarantee has been in place since 1996, and is a rule that binds the UTC. So nope, sorry, no more PUA ranges. Meow! Grrr! ;-) As I see it, the only feasible way for people to get specialized behavior for PUA ranges involves first ceasing to assume that somehow they can jawbone the UTC into *standardizing* some ranges for some particular use or another. That simply isn't going to happen. People who assume this is somehow easy, and that the UTC are a bunch of boneheads who stand in the way of obvious solutions, do not -- I contend -- understand the complicated interplay of character properties, stability guarantees, and implementation behavior baked into system support libraries for the Unicode Standard. The way forward for folks who want to do this kind thing is: 1. Define a *protocol* for reliable interchange of custom character property information about PUA code points. 2. Convince more than one party to actually *use* that protocol to define sets of interchangeable character property definitions. 3. Convince at least one implementer to support that protocol to create some relevant interchangeable *behavior* for those PUA characters. And if the goal for #3 is to get some *system* implementer to support the protocol in widespread software, then before starting any of #1, #2, or #3, you had better start instead with: 0. Create a consortium (or other ongoing organization) with a 10-year time horizon and participation by at least one major software implementer, to define, publicize, and advocate for support of the protocol. (And if you expect a major software implementer to participate, you might need to make sure you have a business case defined that would warrant such a 10-year effort!) --Ken
Re: Private Use areas
On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote: Is there a block of RTL PUA also? No. --Ken
Re: Tales from the Archives
Steffen noted: On 8/20/2018 3:22 PM, Steffen Nurpmeso via Unicode wrote: It was just that i have read on one of the mailing-lists i am subscribed to a cite of a Unicode statement that i have never read of anything on the Unicode mailing-list. It is very awkward, but i_again_ cannot find what attracted my attention, even with the help of a search machine. I think "faith alone will reveal the true name of shuruq" (1997-07-18). --steffen Fortunately, since I collect everything, this one has not been lost to the mists of history yet. So here you go, another "tale from the archives", aka "every character has a story". --Ken === From kenw Thu Sep 18 14:23 PDT 1997 Date: Thu, 18 Sep 1997 14:20:29 -0700 From: kenw (Kenneth Whistler) Message-Id: <9709182120.aa16...@birdie.sybase.com> To: unicode@unicode.org Subject: War over 'shuruq' narrowly averted Cc: kenw Dateline: Geneva, Thursday, September 18, 1997 The ISOnominalists and the SInominalists met today at the bargaining table in their long-running dispute over whether the correct name of U+05BC should be: HEBREW POINT DAGESH OR MAPIQ (shuruq) or HEBREW POINT DAGESH OR MAPIQ OR SHURUQ After considerable posturing and threats by both sides, opposing camps reluctantly agreed that a compromise solution was preferable to open flamewar. Unnamed sources state that the new name to be revealed in a press conference this evening is: HEBREW POINT DAGESH OR MAPIQ (or shuruq) Both sides have also now agreed to focus their attention jointly at countering the antinomianist camp, which claims that no names can be imposed by human moral strictures, and that faith alone will reveal the true name of shuruq. =
Re: Tales from the Archives
Steffen, Are you looking for the Unicode list email archives? https://www.unicode.org/mail-arch/ Those contain list content going back all the way to 1994. --Ken On 8/20/2018 6:08 AM, Steffen Nurpmeso via Unicode wrote: I have the impression that many things which have been posted here some years ago are now only available via some Forums or other browser based services. What is posted here seems to be mostly a duplicate of the blog only.
Re: UAX #9: applicability of higher-level protocols to bidi plaintext
On 7/18/2018 6:43 AM, philip chastney via Unicode wrote: there are also contexts where "Hello World!" can be read as the function "Hello", applied to the factorial value of "World" even though such a move wouldn't necessarily remove all ambiguity, the easiest solution is to declare that formal notations cannot be "plain" text Of course they can -- and (usually) should be, as they are designed that way. To state otherwise would just create headaches for designing parsers for formal notations. I think you are confusing ambiguity of *interpretation* of bits of formal notation, taken out of context, with ambiguity of *display* of formal notations in contexts where one does not know and control the paragraph directionality. The easiest (and correct) solution, when displaying formal notation for visual interpretation by human readers, is to use tools where one knows and can rely on the paragraph directionality explicitly, so that Unicode bidi doesn't add an out-of-left-field set of display conundrums, as it were, for bidi edge cases that can result in *mis*interpretation by the reader. In other words, if I am trying to read C program text or regex expressions, I expect that my tooling is not going to silently assume a RTL paragraph directional context and present me with visual garbage to interpret, forcing me to reverse engineer the bidi algorithm in my head, just to read the text. Why would I put up with that? --Ken
Re: UAX #9: applicability of higher-level protocols to bidi plaintext
On 7/16/2018 3:51 PM, Shai Berger via Unicode wrote: And I should add, in response to the other points raised in this thread, from the same page in the core standard: "If the same plain text sequence is given to disparate rendering processes, there is no expectation that rendered text in each instance should have the same appearance. Instead, the disparate rendering processes are simply required to make the text legible according to the intended reading." That paragraph ends with the following summary, emphasized in the source: Plain text must contain enough information to permit the text to be rendered legibly, and nothing more. The last answer inhttp://www.unicode.org/faq/bidi.html violates this dictum, as I have showed here with different examples. As long as it stands, the Unicode standard fails its own criteria. I've been trying to following your reasoning in this long thread, but am still not finding much to convince that there is anything wrong in the #bidi8 FAQ entry that you keep claiming is wrong. First, for your "Hello, world!" example, in a rendering that imposes a RTL directional context, the correct, conformant display of that string is: !Hello, world as you cited in your earlier example. To do otherwise, would represent a *non*-conformant implementation of the UBA. So your complaint seems to boil down to the claim that if you transmit "Hello, world!" to a process which then renders it conformantly according to the Unicode Standard (including UBA), then that process must somehow know *and honor* your intent that it display in a LTR directional context. That information, however, is explicitly *not* contained in the plain text string there, and has to be conveyed by means of a higher-level protocol. (E.g. HTML markup as dir="ltr", etc.) If the receiving process, by whatever means, has raised its hand and says, effectively, "I assume a RTL context for all text display", that is its right. You can't complain if it displays your "Hello, world!" as shown above. Well, you *can* complain, but you wouldn't be correct. Basically, you and the receiving process do not share the same assumptions about the higher-level protocol involved which specifies paragraph direction. So as I see it, you are either wanting the plain text to somehow contain and enforce upon the renderer your assumption about the directional context that it should be displayed in, OR, you are just unhappy about the bidirectional rendering conundrums of some edge cases for the UBA. In either case, the remedy is the application of LTR characters to provide context (or directional isolate controls, or explicit higher-level markup). --Ken
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On 5/29/2018 12:49 AM, Richard Wordingham via Unicode wrote: How would one know that they are misapplied? And what if the author of the text has broken your rules? Are such texts never to be transcribed to pukka Unicode? Applying Tamil -ii (0BC0, Script=Tamil) to the Latin letter a (0061, Script=Latin) doesn't automatically make the Tamil vowel "inherit" the Latin script property value, nor should it. That said, if someone decides they want that sequence, and their text as "broken my rules", so be it. I'm just not going to assume anything particular about that text. Note that in terms of trying to determine whether such a string is (naively) alphabetic, such a sequence doesn't interfere with the determination. On the other hand, a process concerned about text runs, script assignment, validity for domains, or other such issues *will* be sensitive to such a boundary -- and should not be overruled by some generic determination that combining marks inherit all the properties of their base. Even without knowing exactly what is wanted, it looks to me as though it isn't. If he wants to allow as a substring, which he should, then that fails because there is no overlap between p{extender} and p{gc=Cf} or between p{diacritic} and p{gc=Cf}. Yes, so if you are working with strings for Indic scripts (or for that matter, Arabic), you add Join_Control to the mix: Alphabetic ∪ Diacritic ∪ Extender ∪ Join_Control gets you a decent approximation of what is (naively) expected to fall within an "alphabetic" string for most scripts. For those following along, Alphabetic is roughly meant to cover the ABC, かきくけこ,... plus ideographic elements of most scripts. Diacritic picks up most of the applied combining marks, including nuktas, viramas, and tone marks. Extender picks up spacing elements that indicate length, reduplication, iteration, etc. And joiners are, well, joiners. If one wants finer categorization specifically for Indic scripts, then I would suggest turning to the Indic_Syllabic_Category property instead of a union of PropList.txt properties and/or some twiddling with General_Category values. --Ken
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On 5/28/2018 9:44 PM, Asmus Freytag via Unicode wrote: One of the general principles is that combining marks inherit the property of their base character. Normally, "inherited" should be the only property value for combining marks. There have been some deviations from this over the years, for various reasons, and there are some properties (such as general category) where it is necessary to recognize the character as combining, but the general principle still holds. Therefore, if you are trying to see whether a string is alphabetic, combining marks should be "transparent" to such an algorithm. Generally, good advice. But there are clear exceptions. For example, the enclosing combining marks for symbols are intended (basically) to make symbols of a sort. And many combining marks have explicit script assigments, so they cannot simply willy-nilly inherit the script of a base letter if they are misapplied, for example. This is why I recommend simply adding the Diacritic property into the mix for testing a string. That is a closer approximation to the kind of naive "Is this string alphabetic?" question that SunaraRaman was asking about -- it picks up the correct subset of combining marks to union with the set of actual isAlphabetic characters, to produce more expected results. (Including, of course, the correct classification of all the viramas, stackers, and killers, as well as picking up all the nuktas.). Folks, please examine the set of character for Diacritic and for Extender in: http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt to see what I'm talking about. The stuff you are looking for is already there. --Ken P.S. And please don't start an argument about the fact that a "virama" isn't really a "diacritic". We know that, too. ;-)
Re: Why is TAMIL SIGN VIRAMA (pulli) not Alphabetic?
On 5/28/2018 9:23 PM, Martin J. Dürst via Unicode wrote: Hello Sundar, On 2018/05/28 04:27, SundaraRaman R via Unicode wrote: Hi, In languages like Ruby or Java (https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)), functions to check if a character is alphabetic do that by looking for the 'Alphabetic' property (defined true if it's in one of the L categories, or Nl, or has 'Other_Alphabetic' property). When parsing Tamil text, this works out well for independent vowels and consonants (which are in Lo), and for most dependent signs (which are in Mc or Mn but have the 'Other_Alphabetic' property), but the very common pulli (VIRAMA) is neither in Lo nor has 'Other_Alphabetic', and so leads to concluding any string containing it to be non-alphabetic. This doesn't make sense to me since the Virama “◌்” as much of an alphabetic character as any of the "Dependent Vowel" characters which have been given the 'Other_Alphabetic' property. Is there a rationale behind this difference, or is it an oversight to be corrected? I suggest submitting an error report via https://www.unicode.org/reporting.html. I haven't studied the issue in detail (sorry, just no time this week), but it sounds reasonable to give the VIRAMA the 'Other_Alphabetic' property. Please don't. This is not an error in the Unicode property assignments, which have been stable in scope for Alphabetic for some time now. The problem is in assuming that the Java or Ruby isAphabetic() API, which simply report the Unicode property value Alphabetic for a character, suffices for identifying a string as somehow "wordlike". It doesn't. The approximation you are looking for is to add Diacritic to Alphabetic. That will automatically pull in all the nuktas and viramas/killers for Brahmi-derived scripts. It also will pull in the harakat for Arabic and similar abjads, which are also not Alphabetic in the property values. And it will pull in tone marks for various writing systems. For good measure, also add Extender, which will pick up length marks and iteration marks. Please do not assume that the Alphabetic property just automatically equates to "what I would write in a word". Or that it should be adjusted to somehow make that happen. It would be highly advisable to study *all* the UCD properties in more depth, before starting to report bugs in one or another simply because using a single property doesn't produce the string classification one assumes should be correct in a particular case. Of course, to get a better approximation of what actually constitutes a "word" in a particular writing system, instead of using raw property API's, one should be using a WordBreak iterator, preferably one tailored for the language in question. --Ken I'd recommend to mention examples other than Tamil in your report (assuming they exist). BTW, what's the method you are using in Ruby? If there's a problem in Ruby (which I don't think; it's just using Unicode data), then please make a bug report at https://bugs.ruby-lang.org/projects/ruby-trunk, I should be able to follow up on that. Regards, Martin.
Re: Major vendors changing U+1F52B PISTOL depiction from firearm to squirt gun
On 5/23/2018 8:53 AM, Abe Voelker via Unicode wrote: As a user I find it troublesome because previous messages I've sent using this character on these platforms may now be interpreted differently due to the changed representation. That aspect has me wondering if this change is in line with Unicode standard conformance requirements. The Unicode Standard publishes only *text presentation* (black and white) representative glyphs for emoji characters. And those text presentation glyphs have been quite stable in the standard. For U+1F52B PISTOL, the glyph currently published in Unicode 10.0 (and the one which will be published imminently in Unicode 11.0) is precisely the same as the glyph that was initially published nearly 8 years ago in Unicode 6.0. Care to check up on that? Unicode 6.0: https://www.unicode.org/charts/PDF/Unicode-6.0/U60-1F300.pdf Unicode 11.0: https://www.unicode.org/charts/PDF/Unicode-11.0/U110-1F300.pdf What vendors do for their colorful *emoji presentation* glyphs is basically outside the scope of the Unicode Standard. Technically, it is outside the scope even of the separate Unicode Technical Standard #51, Unicode Emoji, which specifies data, behavior, and other mechanisms for promoting interoperability and valid interchange of emoji characters and emoji sequences, but which does *not* try to constrain vendors in their emoji glyph designs. Now, sure, nobody wants their emoji for an avocado, to willy-nilly turn into a completely unrelated emoji for a crying face. But many emoji are deliberately vague in their scope of denotation and connotation, and the vendors have a lot a leeway to design little images that they like and their customers like. And the Unicode Standard does not now and probably never will try to define and enforce precise semantics and usage rules for every single emoji character. Basically, it is a fool's game to be using emoji as if they were a well-defined and standardized pictographic orthography with unchanging semantics. If you want stable presentation of content, use a pdf document or an image. If you want stable and accurate conveyance of particular meaning -- well, write it out in the standard orthography of a particular language. If you want playful and emotional little pictographs accompanying text, well, then don't expect either stability of the images or the meaning, because that isn't how emoji work. Case in point: if you are using U+1F351 PEACH for its well-known resemblance to a bum, well, don't complain to the Unicode Consortium if a phone vendor changes the meaning of your message by redesigning its emoji glyph for U+1F351 to a cut peach slice that more resembles a smile. --Ken
Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp symbols
On 5/15/2018 2:46 PM, Markus Scherer via Unicode wrote: I am proposing the addition of 2 new characters to the Musical Symbols table: - the half-flat sign (lowers a note by a quarter tone) - the half-sharp sign (raises a note by a quarter tone) In an actual proposal, I would expect a discussion of whether you are proposing to encode established symbols, or whether you are proposing new symbols to be adopted by the community (in which case Unicode would probably wait & see if they get established). A proposal should also show evidence of usage and glyph variations. And should probably refer to the relationship between these signs and the existing: U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT which are also half-sharp or half-flat accidentals. The wiki on flat signs shows this flat with a crossbar, as well as a reversed flat symbol, to represent the half-flat. And the wiki on sharp signs shows this sharp minus one vertical bar to represent the half-sharp. So there may be some use of these signs in microtonal notation, outside of an Arabic context, as well. See: https://en.wikipedia.org/wiki/Accidental_(music)#Microtonal_notation --Ken
Re: Is the Editor's Draft public?
Henri, There is no formal concept of a public "Editor's Draft" for the Unicode core specification. This is mostly the result of the tools used for editing the core specification, which is still structured more like a book than the usual online internet specification. Currently the Unicode editors are finishing up the 11.0 core specification editing -- and the chapters for that will be available in June, 2018, as noted on the current draft of the Unicode 11.0 page. There is no Version 12.0 "Editor's Draft" right now; instead, work on the 12.0 core specification will start once the 11.0 chapters have been frozen and published. If you have feedback on the core specification, the best thing to do is simply to submit it now as part of the current 11.0 beta review, referring to the published 10.0 core specification text. If it is a small item, such as a typo, there is always the possibility that it has already been reported and fixed, of course -- but it won't hurt to report and check. Suggestions for larger changes in the text will be added to the pile for future consideration by the UTC and the editors, and likely would be taken up for the 12.0 core specification. --Ken On 4/20/2018 3:14 AM, Henri Sivonen via Unicode wrote: Thank you. I checked this review announcement (I should have said so in my email; sorry), but it leads me to https://unicode.org/versions/Unicode11.0.0/ which says the chapters will be "Available June 2018". But even if the 11.0 chapters were available, I'd expect there to exist an Editor's Draft that's now in a post-11.0 but pre-12.0 state. I guess I should just send my comments and take the risk of my concerns already having been addressed.
Re: Fwd: RFC 8369 on Internationalizing IPv6 Using 128-Bit Unicode
On 4/2/2018 7:02 PM, Philippe Verdy via Unicode wrote: We're missing the definition of "ymojis", a safer alternatives of "umojis" (unknown), but that "you" can create yourself for use by yourself Not to mention "əmojis", as in "Uh, Moe! Jeez, why are we still talking about this?!" --Ken
Re: Unicode Emoji 11.0 characters now ready for adoption!
On 3/9/2018 9:29 AM, via Unicode wrote: Documented increase such as scientific terms for new elements, flora and fauna, would seem to be not more one or two dozen a year. Indeed. Of the "urgently needed characters" added to the unified CJK ideographs for Unicode 11.0, two were obscure place name characters needed to complete mapping for the Japanese IT mandatory use of the Moji Joho collection. The other three were newly standardized Chinese characters for superheavy elements that now have official designations by the IUPAC (as of December 2015): Nihonium (113), Tennessine (117) and Oganesson (118). The Chinese characters coined for those 3 were encoded at U+9FED, U+9FEC, and U+9FEB, respectively. Oganesson, in particular, is of interest, as the heaviest known element produced to date. It is the subject of 1000's of hours of intense experimentation and of hundreds of scientific papers, but: ... since 2005, only five (possibly six) atoms of the nuclide ^294 Og have been detected. But we already have a Chinese character (pronounced ào) for Og, and a standardized Unicode code point for it: U+9FEB. Next up: unobtanium and hardtofindium --Ken
Re: Translating the standard
On 3/9/2018 6:58 AM, Marcel Schneider via Unicode wrote: As of translating the Core spec as a whole, why did two recent attempts crash even before the maintenance stage, while the 3.1 project succeeded? Essentially because both the Japanese and the Chinese attempts were conceived of as commercial projects, which ultimately did not cost out for the publishers, I think. Both projects attempted limiting the scope of their translation to a subset of the core spec that would focus on East Asian topics, but the core spec is complex enough that it does not abridge well. And I think both projects ran into difficulties in trying to figure out how to deal with fonts and figures. The Unicode 3.0 translation (and the 3.1 update) by Patrick Andries was a labor of love. In this arena, a labor of love is far more likely to succeed than a commercial translation project, because it doesn't have to make financial sense. By the way, as a kind of annotation to an annotated translation, people should know that the 3.1 translation on Patrick's site is not a straight translation of 3.1, but a kind of interpreted adaptation. In particular, it incorporated a translation of UAX #15, Unicode Normalization Forms, Version 3.1.0, as a Chapter 6 of the translation, which is not the actual structure of Unicode 3.1. And there are other abridgements and alterations, where they make sense -- compare the resources section of the Preface, for example. This is not a knock on Patrick's excellent translation work, but it does illustrate the inherent difficulties of trying to approach a complete translation project for *any* version of the Unicode Standard. --Ken
Re: Unicode Emoji 11.0 characters now ready for adoption!
On 3/7/2018 1:12 PM, Philippe Verdy via Unicode wrote: Shouldn't we create a variant of IDS, using combining joiners between Han base glyphs (then possibly augmented by variant selectors if there are significant differences on the simplification of rendered strokes for each component) ? What is really limiting us to do that ? Ummm ambiguity, lack of precision, complexity of model, pushback by stakeholders, likely failure of uptake by most implementers, duplication of representation, ... Do you think combining models of Han weren't already thought of years ago? They predated the original encoding of unified CJK in Unicode in 1992. They weren't viable then, and they aren't viable now, either, after 26 years of Unicode implementation of unified CJK as atomic ideographs. --Ken
Translating the standard (was: Re: Fonts and font sizes used in the Unicode)
On 3/5/2018 9:03 AM, suzuki toshiya via Unicode wrote: I have a question; if some people try to make a translated version of Unicode And to add to Asmus' response, folks on the list should understand that even with the best of effort, the concept of a "translated version of Unicode" is a near impossibility. In fairly recent times, two serious efforts to translate *just *the core specification -- one in Japanese, and a somewhat later attempt for Chinese -- crashed and burned, for a variety of reasons. The core specification is huge, contains a lot of very specific technical terminology that is difficult to translate, along with a large collection of script- and language-specific detail, also hard to translate. Worse, it keeps changing, with updates now coming out once every year. Some large parts are stable, but it is impossible to predict what sections might be impacted by the next year's encoding decisions. That is not including that fact that "the Unicode Standard" now also includes 14 separate HTML (or XHTML) annexes, all of which are also moving targets, along with the UCD data files, which often contain important information in their headers that would also require translation. And then, of course, there are the 2000+ pages of the formatted code charts, which require highly specific and very complicated custom tooling and font usage to produce. It would require a dedicated (and expensive) small army of translators, terminologists, editors, programmers, font designers, and project managers to replicate all of this into another language publication -- and then they would have to do it again the next year, and again the next year, in perpetuity. Basically, given the current situation, it would be a fool's errand, more likely to introduce errors and inconsistencies than to help anybody with actual implementation. People who want accessibility to the Unicode Standard in other languages need to scale down their expectations considerably, and focus on preparing reasonably short and succinct introductions to the terminology and complexity involved in the full standard. Such projects are feasible. But a full translation of "the Unicode Standard" simply is not. --Ken
CJK Ideograph Encoding Velocity (was: Re: Unicode Emoji 11.0 characters now ready for adoption!)
John, I think this may be giving the list a somewhat misleading picture of the actual statistics for encoding of CJK unified ideographs. The "500 characters a year" or "1000 characters a year" limits are administrative limits set by the IRG for national bodies (and others) submitting repertoire to the "working set" that the IRG then segments into chunks for processing to prepare new increments for actual encoding. In point of fact, if we take 1991 as the base year, the *average* rate of encoding new CJK unified ideographs now stands at 3379 per annum (87,860 as of Unicode 10.0). By "encoding" here, I mean, final, finished publication of the encoded characters -- not the larger number of potentially unifiable submissions that eventually go into a publication increment. There is a gradual downward drift in that number over time, because of the impact on the stats of the "big bang" encoding of 42,711 ideographs for Extension B back in 2001, but recently, the numbers have been quite consistent with an average incremental rate of about 3000 new ideographs per year: 5762 added for Extension E in 2015 7463 added for Extension F in 2017 ~ 4934 to be added for Extension G, probably to be published in 2020 If you run the average calculation including Extension G, assuming 2020, you end up with a cumulative per annum rate of 3200, not much different than the calculation done as of today. And as for the implication that China, in particular, is somehow limited by these numbers, one should note that the vast majority of Extension G is associated with Chinese sources. Although a substantial chunk is formally labeled with a "UK" source this time around, almost all of those characters represent a roll-in of systematic simplifications, of various sorts, associated with PRC usage. (People who want to check can take a look at L2/17-366R in the UTC document registry.) --Ken On 3/5/2018 7:13 AM, via Unicode wrote: Dear All, to simplify discussion I have split the points.
Re: Bidi edge cases in Hangul and Indic
David, On 2/22/2018 7:21 PM, David Corbett via Unicode wrote: My confusion stems from Unicode’s online bidi utility. That bidi utility has known defects in it. It is not yet conformant with changes to UBA 6.3, let alone later changes to UBA. And the mapping of memory position to display position in that utility does not take into account complex mapping that has to occur in the layout engines and fonts in real applications. --Ken
Re: IDC's versus Egyptian format controls
On 2/16/2018 11:00 AM, Asmus Freytag via Unicode wrote: On 2/16/2018 8:00 AM, Richard Wordingham via Unicode wrote: That doesn't square well with, "An implementation *may* render a valid Ideographic Description Sequence either by rendering the individual characters separately or by parsing the Ideographic Description Sequence and drawing the ideograph so described." (TUS 10.0 p704, in Section 18.2) Emphasis on the "may". In point of fact, no widespread layout engine or set of fonts does parse IDS'es to turn them into single ideographs for display. That would be a highly specialized display. Should we ask t make the default behavior (visible IDS characters) more explicit? Ask away. --Ken I don't mind allowing the other as an option (it's kind of the reverse of the "show invisible" mode, which we also allow, but for which we do have a clear default).
Re: IDC's versus Egyptian format controls
On 2/16/2018 8:22 AM, Ken Whistler wrote: The Egyptian quadrat controls, on the other hand, are full-fledged Unicode format controls. One more point of distinction: The (gc=So) IDC's follow a syntax that uses Polish notation order for the descriptive operators (inherited from the intended use in GB 18030, where these came from in the first place). That order minimizes ambiguity of representation without requiring bracketing, but it has the disadvantage of being hard for humans to interpret easily in complicated cases. The Egyptian format controls use an infix notation, instead. That follows current Egyptologists' practice of representing quadrats with MdC conventions. It is also a better order for the layout engine processing. The disadvantage is that it requires a bracketing notation to deal with ambiguities of operator precedence in complicated cases. --Ken
IDC's versus Egyptian format controls (was: Re: Why so much emoji nonsense?)
On 2/16/2018 8:00 AM, Richard Wordingham via Unicode wrote: A more portable solution for ideographs is to render an Ideographic Description Sequences (IDS) as approximations to the characters they describe. The Unicode Standard carefully does not prohibit so doing, and a similar scheme is being developed for blocks of Egyptian Hieroglyphs, and has been proposed for Mayan as well. A point of clarification: The IDC's (ideographic description characters) are explicitly *not* format controls. They are visible graphic symbols that sit visibly in text. There is a specified syntax for stringing them together into sequences with ideographic characters and radicals to *suggest* a specific form of CJK (or other ideographic) character assembled from the pieces in a certain order -- but there is no implication that a generic text layout process *should* attempt to assemble that described character as a single glyph. IDC's are a *description* methodology. IDC's are General_Category=So. The Egyptian quadrat controls, on the other hand, are full-fledged Unicode format controls. They do not just describe hieroglyphic quadrats -- they are intended to be implemented in text format software and OpenType fonts to actually construct and display fully-formed quadrats on the fly. They will be General_Category=Cf. Mayan will work in a similar manner, although the specification of the sign list and exact required set of format controls is not yet as mature as that for Egyptian. --Ken
Re: Why so much emoji nonsense?
On 2/15/2018 2:24 PM, Philippe Verdy via Unicode wrote: And it's in the mission of Unicode, IMHO, to promote litteracy Um, no. And not even literacy, either. ;-) https://en.wikipedia.org/wiki/Category:Organizations_promoting_literacy --Ken
Re: Why so much emoji nonsense?
On 2/14/2018 12:49 PM, Philippe Verdy via Unicode wrote: RCLLTHTWHNLPHBTSWRFRSTNVNTDPPLWRTTXTLKTHS ! [ ... lots to say about the history of writing ... ] And the use (or abuse) of emojis is returning us to the prehistory when people draw animals on walls of caverns: this was a very slow communication, not giving a rich semantic, full of ambiguities about what is really meant, and in fact a severe loss of knowledge where people will not communicate easily and rapidly. =-O Perhaps Philippe was missing my point about how and why emoji are actually used. --Ken
Re: Why so much emoji nonsense?
On 2/14/2018 12:53 AM, Erik Pedersen via Unicode wrote: Unlike text composed of the world’s traditional alphabetic, syllabic, abugida or CJK characters, emoji convey no utilitarian and unambiguous information content. I think this represents a misunderstanding of the function of emoji in written communication, as well as a rather narrow concept of how writing systems work and why they have evolved. RECALLTHATWHENALPHABETSWEREFIRSTINVENTEDPEOPLEWROTETEXTLIKETHIS The invention and development of word spacing, punctuation, and casing, among other elements of typography, represent the addition of meta-level information to written communication that assists in legibility, helps identify lexical and syntactic units, conveys prosody, and other information that is not well conveyed by simply setting down letters of an alphabet one right after the other. Emoticons were invented, in large part, to fill another major hole in written communication -- the need to convey emotional state and affective attitudes towards the text. This is the kind of information that face-to-face communication has a huge and evolutionarily deep bandwidth for, but which written communication typically fails miserably at. Just adding a little happy face :-) or sad face :-( to a short email manages to convey some affect much more easily and effectively than adding on entire paragraphs trying to explain how one feels about what was just said. Novelists have the skill to do that in text without using little pictographic icons, but most of us are not professional writers! Note that emoticons were invented almost as soon as people started communicating in digital mediums like email -- so long predate anything Unicode came up with. Other kinds of emoji that we've been adding recently may have a somewhat more uncertain trajectory, but the ones that seem to be most successful are precisely those which manage to connect emotionally with people, and which assist them in conveying how they *feel* about what they are writing. So I would suggest that people not just dismiss (or diss) this ongoing phenomenon. Emoji are widely used for many good reasons. And of course, like any other aspect of writing, get mis-used in various ways, as well. But you can be sure that their impact on the evolution of world writing is here to stay and will be the topic of serious scholastic papers by scholars of writing for decades to come. ;-) --Ken
Re: Word_Break for Hieroglyphs
Gentlemen, On 12/14/2017 6:53 AM, Mark Davis ☕️ via Unicode wrote: Thus I would like people who are both knowledgeable about hieroglyphs /and/ Unicode properties to weigh in. I know that people like Andrew Glass are on this list, who satisfy both criteria. And what constitutes a cluster? This entire discussion is premature. The model for Egyptian is in flux right now. What constitutes a "quadrat", which is significantly relevant to any determination of how other segmentation properties should work for Egyptian hieroglyphics, will depend on the details of the model and how quadrat formation interacts with the exact set of format controls eventually agreed upon. See: http://www.unicode.org/L2/L2017/17112r-quadrat-encoding.pdf (And please note that that has a reference list of 13 *other* documents. This is not simple stuff.) When we get closure on the Egyptian model, *then* will be the time to make suggestions for how Egyptian values for GCB, WB, and LB might we adjusted for possible better default behavior. --Ken
Re: Armenian Mijaket (Armenian colon)
Asmus, On 12/5/2017 12:35 PM, Asmus Freytag via Unicode wrote: I don't know the history of this particular "unification" Here are some clues to guide further research on the history. The annotation in question was added to a draft of the NamesList.txt file for Unicode 4.1 on October 7, 2003. The annotation was not yet in the Unicode 4.0 charts, published in April, 2003. That should narrow down the search for everybody. I can't find specific mention of this in the UTC minutes from the relevant 2003 window. But I strongly suspect that the catalyst for the change was the discussion that took place regarding PRI #12 re terminal punctuation: http://www.unicode.org/review/pr-12.html That document, at least, does mention "Armenian" and U+2024, although not in the same breath. That PRI was discussed and closed at UTC #96, on August 25, 2003: http://www.unicode.org/L2/L2003/03240.htm I don't find any particular mention of U+2024 in my own notes from that meeting, so I suspect the proximal cause for the change to the annotation for U+2024 on October 7 will have to be dug out of an email archive at some point. --Ken
Re: implicit weight base for U+2CEA2
On 9/27/2017 2:19 PM, Markus Scherer via Unicode wrote: On Wed, Sep 27, 2017 at 1:49 PM, James Tauber via Unicode> wrote: I recently updated pyuca[1], my pure Python implementation of the Unicode Collation Algorithm to work with 8.0.0, 9.0.0, and 10.0.0 but to get all the tests to work, I had to special case the implicit weight base for U+2CEA2. The spec seems to suggest the base should be FB80 but I had to override just that code point to have a base of FBC0 for the tests to pass. Is this a known issue with the spec or something I've missed? 2CEA2..2CEAF are unassigned code points for which the UCA+DUCET uses a base of FBC0. markus And you may have a range error in Extension E to account for the test problem. The relevant section of CollationTest_SHIFTED_SHORT.txt has tests that will pass only if: 2B735 < 2B81E < 2CEA2 < 2EBE1 < 2FFFE Ext C< Ext D < Ext E < Ext F < non-character Those are *unassigned* characters just past the assigned ranges but still in the blocks in each of those CJK extensions. So if you have a range error for assigned characters in Extension E, you'd get a failure at that point in the text cases. --Ken
Re: IBM 1620 invalid character symbol
Ken, On 9/27/2017 11:10 AM, Ken Shirriff via Unicode wrote: The IBM type catalog might be of interest. It describes in great detail the character sets of the IBM typewriters and line printers and the custom characters that can be ordered for printer chains and Selectric type balls. Link: http://bitsavers.org/pdf/ibm/serviceForConsultants/Service_For_Consultants_198312_Complete/15_Type_Catalog.pdf That is a very interesting source, though from a much later era (1983). In particular, the "Special Character Nomenclature" (p. 11 of the pdf) provides a good list of what the IBM typographers at the time thought was the range of special symbols they were working within this overall collection. Note the presence of the group mark, the record mark, and the segment mark. And in the realm of potential "tofu" indicators, there is the open box and the OCR blob, but nothing like the 1620 symbol(s) we've been talking about. On another point, the "pillow" noted for the invalid character in the IBM 1620-2 (using the Selectric instead of the older IBM typewriter model) was almost certainly also not an actual punch on the Selectric type ball, but instead implemented by an overstrike of "[" and "]". See, e.g., the Pica 72 type style in the catalog noted above, which looks like some of the very earliest Selectric type. Its use could well have been occasioned by the fact that the slab serif typewriter font would have created a muddy blob if you tried to overstrike an "X" and and "I" for this output symbol. --Ken
Re: IBM 1620 invalid character symbol
Asmus, On 9/27/2017 10:02 AM, Asmus Freytag via Unicode wrote: In that context it's worth remembering that there while you could say for most typewriters that "the typewriter is the font", there were noted exceptions. The IBM Selectric, for example, had exchangeable type balls which allowed both a font and / or encoding change. (Encoding understood here as association of character to key). That technology was then only two years in the future. And in some sense, not even... ;-) By the 1950's (and probably earlier), enterprising linguists and other special users were conspiring with skilled typewriter repair experts to customize their manual typewriter keyboards and key strikers with custom fonts. I have an example sitting in my office -- an old Olympia manual typewriter with custom-cast type replacing the standard punches on some of the key strikers, and with custom engraved key caps added to the keyboard, to add schwa, eng, open-o, etc. to the typewriter. It also has the bottom dot of the colon *filed off* to create a middle dot key. Typing an actual colon on that machine requires an "input method" consisting of 3 key presses: {period, backspace, middledot} A couple of the keys that have raised accents on them were modified so as disable the platen advance, thereby becoming permanent "dead keys" -- effectively emulating the encoding of combining marks. There are probably thousands of such customized manual typewriters still sitting around, over and beyond the various standard manufactured models. --Ken
Re: IBM 1620 invalid character symbol
Leo, On 9/26/2017 9:00 PM, Leo Broukhis via Unicode wrote: The next time I'm at the Mountain View CHM, I'll try to ask. However, assuming it was an overstrike of an X and an I, then where does the "Eris"-like glyph come from? Was there ever an IBM font with a double-semicircular X like )( ? The reason for focusing on the hardware is that during operation of an IBM 1620, that is what would have been printed on paper by the actual machines, and what people would have seen in core dumps, or whatever. The question of what was printed in the *documentation* is a different issue, really. That involves figuring out what the editors/typesetters of the manuals were doing to represent a symbol generated by overstriking by the hardware, for which they had no convenient type to use, by whatever word processing and printing technology they were using circa 1959. I suspect that both the "Zhe"-like glyph and the "Eris"-like glyph we have seen in the printed copies of the manual are themselves typesetter substituted glyphs for whatever the 1620 tofu glyph was that they were trying to represent. Where they got those glyphs, I dunno -- and it might be pretty difficult to track down, because almost all the folks who would have known what IBM manual typesetting practices were circa 1959 will have passed on by now. I don't know of any *standard* IBM glyph for this "Eris"-like thingie seen in the scanned bit of manual that started this thread -- but my documentation is from the 1980's era listings of standardized glyph identifiers. Who knows what was going on circa 1959, which predated most of the IBM efforts to standardize large glyph sets and large numbers of character sets? Back then, "fonts" consisted of what were cast on the typebars of typewriters, or on the strikers of line printers, or the physical type that typesetters used. Look at the archival pictures of the IBM 1620. Do you see any display font anywhere? That console is a Star-Trek style computer console -- all register lights and bit switches and rows of power station style light-up buttons. Not a font anywhere. The only font on that machine can be found by feeling the key strikers in the typewriter. --Ken
Re: IBM 1620 invalid character symbol
Philippe, Those aren't negative digits, per se. The usage in the manual is with an overline (or macron) to indicate the flag bit. It does occur over a zero, and in explanation in the text of floating point operations, it is also shown over letters (X, M, E) representing digits of the exponent and mantissa. See p. 27 (31 of the pdf) in that same manual, for an extensive discussion with lots of examples in the text: http://www.bitsavers.org/pdf/ibm/1620/A26-5706-3_IBM_1620_CPU_Model_1_Jul65.pdf The Unicode representation of the text material printed on that page would best be done with a combining macron, I think. --Ken On 9/26/2017 6:34 AM, Philippe Verdy via Unicode wrote: But what is interesting is the use of negative digits (-1 to -9, with the minus sign above the digit; I've not seen a case of minus 0, not needed apparently by the described operations) How do you encode these negative decimal digits in Unicode ? with a macron diacritic ?
Re: IBM 1620 invalid character symbol
Leo, Yeah, I know. My point was that by examining the physical typewriter keys (the striking head on the typebar, not the images on the keypads), one could see what could be generated *by* overstriking. I think Philippe's suggestion that it was simply an overstrike of "X" with an "I" is probably the simplest explanation for the actual operation. And the typeset manuals just grabbed some type that looked similar. Note that the typewriters in question didn't have a vertical bar or backslash, apparently. But adding an annotation for similar-looking symbols that could be used for this is, I agree, probably better than looking for a proposal to encode some new symbol for this oddball construction. If it really is an overstrike, then technically, it could probably also be represented as the sequence <0058, 20D2>, just to represent the data. --Ken On 9/25/2017 11:34 PM, Leo Broukhis wrote: If it was implemented as an overprint, either )^H|^H( or \^H|^H/ and was intended to signify an invalid character (for example, in the text part of core dumps, where a period is used by hexdump -C), then there would not be a physical key to generate it.
Re: IBM 1620 invalid character symbol
The 1620 manual accessed from the Wiki page shows the same information but with a different glyph (which looks more like the capital zhe, and is presumably the source of the glyph cited in the Wiki page itself). See: http://www.bitsavers.org/pdf/ibm/1620/A26-5706-3_IBM_1620_CPU_Model_1_Jul65.pdf p. 52 of the document (56/99 of the pdf). So there was some significant glyph variation in the 1620 documentation. My guess is that the invalid character tofu was implemented as an overprint symbol on the 1620 console typewriter (since the overlines and the strikethroughs clearly were). The whole system was basically using only a 50-character character set. But to verify exactly what was going on, somebody would presumably have to examine the physical keys of a 1620 console typewriter to see what they could generate on paper. I'm guessing the Computer History Museum ( http://www.computerhistory.org/ ) would have one sitting around. --Ken On 9/25/2017 9:48 PM, Leo Broukhis via Unicode wrote: Wikipedia (https://en.wikipedia.org/wiki/IBM_1620#Invalid_character) describes the "invalid character" symbol (see attachment) as a Cyrillic Ж which it obviously is not. But what is it? Does it deserve encoding, or is it a glyph variation of an existing codepoint?
Re: Rendering variants of U+3127 Bopomofo Letter I
Albrecht, See TUS, Section 18.3, Bopomofo, p. 707: http://www.unicode.org/versions/Unicode10.0.0/ch18.pdf#G22553 --Ken On 8/24/2017 12:19 AM, Dreiheller, Albrecht via Unicode wrote: Hello Chinese experts, The Letter I in the Bopomofo alphabet (U+3127)has a two rendering variants, a vertical bar and a horizontal bar. Can anyone please tell me the context criteria, when should which variant be used? Is it VR China using the vertical form (like in font SimSun) and Taiwan using the horizontal form (like in fontPMingLiU) ? Thanks Albrecht
Re: emoji props in the ucdxml ?
Manuel, I suspect that such a link may already be in the works for the /Public/emoji/ data directory. But if you want to make sure your suggestion is reviewed by the UTC, you should submit it via the contact form: http://www.unicode.org/reporting.html --Ken On 7/5/2017 12:37 PM, Manuel Strehl via Unicode wrote: but are there any plans to integrate the data in the ucdxml [2] (possibly as separate files) ? No. Not unless and until they become formally part of the UCD. In this context: Would it be possible for the maintainers of the TR #51 data files to add a symlink "latest" under unicode.org/Public/emoji/latest like there is for the UCD? That would be a tremendous time saver, at least for me, having a constant URL to fetch the latest Emoji data from. Who should I ask for such a link? Cheers, Manuel
Re: emoji props in the ucdxml ?
On 7/5/2017 10:01 AM, Daniel Bünzli via Unicode wrote: I know the emoji properties [1] are no formally part of the UCD (not sure exactly why though), Because they are maintained as part of an independent standard now (UTS #51), which is still on track to have a faster turnaround -- and hence faster data updates -- not synched with the annual versions of the Unicode Standard. Hence they cannot be formally a part of the UCD -- unless the entire Unicode Standard were going to be churned on a faster cycle as well. but are there any plans to integrate the data in the ucdxml [2] (possibly as separate files) ? No. Not unless and until they become formally part of the UCD. --Ken
Re: Announcing The Unicode® Standard, Version 10.0
I wonder IF 9 times suffice, But IF more are required, I'll tweet ILY, tweet it twice -- Since spelling's been retired. On 6/21/2017 8:37 AM, William_J_G Overington via Unicode wrote: Here is a mnemonic poem, that I wrote on Monday 20 February 2017, now published as U+1F91F is now officially in The Unicode Standard. One eff nine one eff Is the code number to say In one symbol A very special message To a loved one far away In an email Or a message of text
Re: Running out of code points, redux (was: Re: Feedback on the proposal...)
On 6/1/2017 8:32 PM, Richard Wordingham via Unicode wrote: TUS Section 3 is like the Augean Stables. It is a complete mess as a standards document, That is a matter of editorial taste, I suppose. imputing mental states to computing processes. That, however, is false. The rhetorical turn in the Unicode Standard's conformance clauses, "A process shall interpret..." and "A process shall not interpret..." has been in the standard for 21 years, and seems to have done its general job in guiding interoperable, conformant implementations fairly well. And everyone -- well, perhaps almost everyone -- has been able to figure out that such wording is a shorthand for something along the lines of "Any person implementing software conforming to the Unicode Standard in which a process does X shall implement it in such a way that that process when doing X shall follow the specification part Y, relevant to doing X, exactly according to that specification of Y...", rather than a misguided assumption that software processes are cognitive agents equipped with mental states that the standard can "tell what to think". And I contend that the shorthand works just fine. Table 3-7 for example, should be a consequence of a 'definition' that UTF-8 only represents Unicode Scalar values and excludes 'non-shortest forms'. Well, Definition D92 does already explicitly limit UTF-8 to Unicode scalar values, and explicitly limits the form to sequences of one to four bytes. The reason why it doesn't explicitly include the exclusion of "non-shortest form" in the definition, but instead refers to Table 3-7 for the well-formed sequences (which, btw explicitly rule out all the non-shortest forms), is because that would create another terminological conundrum -- trying to specify an air-tight definition of "non-shortest form (of UTF-8)" before UTF-8 itself is defined. It is terminologically cleaner to let people *derive* non-shortest form from the explicit exclusions of Table 3-7. Instead, the exclusion of the sequence is presented as a brute definition, rather than as a consequence of 0xD800 not being a Unicode scalar value. Likewise, 0xFC fails to be legal because it would define either a 'non-shortest form' or a value that is not a Unicode scalar value. Actually 0xFC fails quite simply and unambiguously, because it is not in Table 3-7. End of story. Same for 0xFF. There is nothing architecturally special about 0xF5..0xFF. All are simply and unambiguously excluded from any well-formed UTF-8 byte sequence. The differences are a matter of presentation; the outcome as to what is permitted is the same. The difference lies rather in whether the rules are comprehensible. A comprehensible definition is more likely to be implemented correctly. Where the presentation makes a difference is in how malformed sequences are naturally handled. Well, I don't think implementers have all that much trouble figuring out what *well-formed* UTF-8 is these days. As for "how malformed sequences are naturally handled", I can't really say. Nor do I think the standard actually requires any particular handling to be conformant. It says thou shalt not emit them, and if you encounter them, thou shalt not interpret them as Unicode characters. Beyond that, it would be nice, of course, if people converged their error handling for malformed sequences in cooperative ways, but there is no conformance statement to that effect in the standard. I have no trouble with the contention that the wording about "best practice" and "recommendations" regarding the handling of U+FFFD has caused some confusion and differences of interpretation among implementers. I'm sure the language in that area could use cleanup, precisely because it has led to contending, incompatible interpretations of the text. As to what actually *is* best practice in use of U+FFFD when attempting to convert ill-formed sequences handed off to UTF-8 conversion processes, or whether the Unicode Standard should attempt to narrow down or change practice in that area, I am completely agnostic. Back to the U+FFFD thread for that discussion. --Ken
Re: Running out of code points, redux (was: Re: Feedback on the proposal...)
On 6/1/2017 6:21 PM, Richard Wordingham via Unicode wrote: By definition D39b, either sequence of bytes, if encountered by an conformant UTF-8 conversion process, would be interpreted as a sequence of 6 maximal subparts of an ill-formed subsequence. ("D39b" is a typo for "D93b".) Sorry about that. :) Conformant with what? There is no mandatory*requirement* for a UTF-8 conversion process conformant with Unicode to have any concept of 'maximal subpart'. Conformant with the definition of UTF-8. I agree that nothing forces a conversion *process* to care anything about maximal subparts, but if *any* process using a conformant definition of UTF-8 then goes on to have any concept of "maximal subpart of an ill-formed subsequence" that departs from definition D93b in the Unicode Standard, then it is just making s**t up. I don't see a good reason to build in special logic to treat FC 80 80 80 80 80 as somehow privileged as a unit for conversion fallback, simply because*if* UTF-8 were defined as the Unix gods intended (which it ain't no longer) then that sequence*could* be interpreted as an out-of-bounds scalar value (which it ain't) on spec that the codespace*might* be extended past 10 at some indefinite time in the future (which it won't). Arguably, it requires special logic to treat FC 80 80 80 80 80 as an invalid sequence. That would be equally true of FF FF FF FF FF FF. Which was my point, actually. FC is not ASCII, True, of course. But irrelevant. Because we are talking about UTF-8 here. And just because some non-UTF-8 character encoding happened to include 0xFC as a valid (or invalid) value, might not require any special case processing. A simple 8-bit to 8-bit conversion table could be completely regular in its processing of 0xFC for a conversion. and has more than one leading bit set. It has the six leading bits set, True, of course. and therefore should start a sequence of 6 characters. That is completely false, and has nothing to do with the current definition of UTF-8. The current, normative definition of UTF-8, in the Unicode Standard, and in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly "obsoletes and replaces RFC 2279") states clearly that 0xFC cannot start a sequence of anything identifiable as UTF-8. --Ken Richard.
Re: Running out of code points, redux (was: Re: Feedback on the proposal...)
On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote: You were implicitly invited to argue that there was no need to handle 5 and 6 byte invalid sequences. Well, working from the *current* specification: FC 80 80 80 80 80 and FF FF FF FF FF FF are equal trash, uninterpretable as *anything* in UTF-8. By definition D39b, either sequence of bytes, if encountered by an conformant UTF-8 conversion process, would be interpreted as a sequence of 6 maximal subparts of an ill-formed subsequence. Whatever your particular strategy for conversion fallbacks for uninterpretable sequences, it ought to treat either one of those trash sequences the same, in my book. I don't see a good reason to build in special logic to treat FC 80 80 80 80 80 as somehow privileged as a unit for conversion fallback, simply because *if* UTF-8 were defined as the Unix gods intended (which it ain't no longer) then that sequence *could* be interpreted as an out-of-bounds scalar value (which it ain't) on spec that the codespace *might* be extended past 10 at some indefinite time in the future (which it won't). --Ken
Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: The link provided about the PRI doesn't lead to the comments. PRI #121 (August, 2008) pre-dated the practice of keeping all the feedback comments together with the PRI itself in a numbered directory with the name "feedback.html". But the comments were collected together at the time and are accessible here: http://www.unicode.org/L2/L2008/08282-pubrev.html#pri121 Also there was a separately submitted comment document: http://www.unicode.org/L2/L2008/08280-pri121-cmt.txt And the minutes of the pertinent UTC meeting (UTC #116): http://www.unicode.org/L2/L2008/08253.htm The minutes simply capture the consensus to adopt Option #2 from PRI #121, and the relevant action items. I now return the floor to the distinguished disputants to continue litigating history. ;-) --Ken
Re: Comparing Raw Values of the Age Property
Richard On 5/23/2017 1:48 PM, Richard Wordingham via Unicode wrote: The object is to generate code*now* that, up to say Unicode Version 23.0, can work out, from the UCD files DerivedAge.txt and PropertyValueAliases.txt, whether an arbitrary code point was included by some Unicode version identified by a Unicode version identified by a value of the property Age. Ah, but keep in mind, if projecting out to Version 23.0 (in the year 2030, by our current schedule), there is a significant chance that particular UCD data files may have morphed into something entirely different. Recall how at one point Unihan.txt morphed into Unihan.zip with multiple subpart files. Even though the maintainers of the UCD data files do our best to maintain them to be as stable as possible, their content and sometimes their formats do morph gradually from release to release. Just don't expect *any* parser to be completely forward proofed against what *might* happen in the UCD in some future version. On the other hand, for the property Age, even in the absence of normative definitions of invariants for the property values, given recent practice, it is pretty damn safe to assume: A. Major versions will continue to have two digits, incremented by one for each subsequent version: 10, 11, 12, ... 99. B. Minor versions will mostly (if not entirely) consist of the value "0", and will never require two digits. Assumption A will get you through this century, which by my estimation should well exceed the lifetime of any code you might be writing now that depends on it. BTW, unlike many actual products, the version numbering of the Unicode Standard is not really driven by marketing concerns. So there is very little chance of some version sequence for Unicode that ends up fitting a pattern like: 3.0, 3.1, 95 or NT, 98, 2000, XP, Vista, 7, 8, 8.1, 10 ... ;-) What TUS 9.0, its appendices and annexes is lacking is a clear statement such as, "The short values for the Age property are of the form "m.n", with the first field corresponding to the major version, and the second field corresponding to the minor version. There is no need for a third version field, because new characters are never assigned in update versions of the standard." I think the UTC and the editors had just been assuming that the pattern was so obvious that it needed no explaining. But the lack of a clear description of Age had become apparent, which is why I wrote that text to add to UAX #44 for the upcoming version. Conveniently, this almost true statement is included in Section 5.14 of the proposed update to UAX#44 (in Draft 12 to be precise. It's not quite true, for there is also the short value NA for Unassigned. Is there any way of formally recording this oversight? Yes. You could always file another piece of feedback using the contact form. However, in this case, you already have the attention of the editors of UAX #44. So my advice would be to simply wait now for the publication of Version 10.0 of UAX #44 around the 3rd week of June. --Ken
Re: English flag (from Re: How to Add Beams to Notes)
On 5/3/2017 3:20 AM, William_J_G Overington via Unicode wrote: Surely a single code point could be found. Single code points are being found for various emoji items on a continuing basis. Why pull up the ladder on encoding some flags each with a single code point? Yes, a single code point for an English flag please. And one for a Welsh flag too please. And one for a Scottish flag too please. And some others please, if that is what end users want. I suggest the following: 10BEDE for an English flag (reminding one of Bede the Venerable) 10CADF for a Welsh flag (harking to Cadfan ap Iago, King of Gwynedd) 10A1BA for a Scottish flag (for Alba, of course) Surely those would work for you! --Ken