Re: [indic] Indian Rupee symbol
On 2010/07/16 16:34, Michael Everson wrote: A proposal to add the character to the Unicode Standard and ISO/IEC 10646 was published yesterday. See http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3862.pdf The shape of the currency sign has been specified as “an amalgam” of the DEVANAGARI LETTER RA, and the LATIN CAPITAL LETTER RA LATIN CAPITAL LETTER RA? Shouldn't that be LATIN CAPITAL LETTER R? Regards,Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Re: Reasonable to propose stability policy on numeric type = decimal
On 2010/07/26 4:37, Asmus Freytag wrote: PPS: a very hypothetical tough case would be a script where letters serve both as letters and as decimal place-value digits, and with modern living practice. Well, there actually is such a script, namely Han. The digits (一、二、 三、四、五、六、七、八、九、〇) are used both as letters and as decimal place-value digits, and they are scattered widely, and of course there are is a lot of modern living practice. Regards,Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Re: Reasonable to propose stability policy on numeric type = decimal
On 2010/07/28 0:36, John Dlugosz wrote: I can imagine supporting national representations for numbers for outputting reports, but I don't imagine anyone writing in a programming language would be compelled to type 四佰六十 instead of 560. Well, indeed, I hope nobody would do that. 四佰六十 would be 460, and 560 would be 五佰六十 :-). Regards, Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Re: Reasonable to propose stability policy on numeric type = decimal
On 2010/07/29 13:33, karl williamson wrote: Asmus Freytag wrote: On 7/25/2010 6:05 PM, Martin J. Dürst wrote: Well, there actually is such a script, namely Han. The digits (一、 二、三、四、五、六、七、八、九、〇) are used both as letters and as decimal place-value digits, and they are scattered widely, and of course there are is a lot of modern living practice. The situation is worse than you indicate, because the same characters are also used as elements in a system that doesn't use place-value, but uses special characters to show powers of 10. Is it the case that a sequence of just these characters, without any intervening characters, and not adjacent to the special characters you mention always mean a place-value decimal number? No. Sequences of numeric Kanji are also used in names and word-plays, and as sequences of individual small numbers. But the same applies to our digits. A very simple example is to use them as a ruler in plain text: 1 2 3 4 5 6 7 1234567890123456789012345678901234567890123456789012345678901234567890 Regards,Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Re: High dot/dot above punctuation?
On 2010/07/29 19:51, Juanma Barranquero wrote: On Thu, Jul 29, 2010 at 10:15, Khaled Hosnykhaledho...@eglug.org wrote: Also, I don't buy in Unicode idea of encoding different sets of decimal digits separately, they are all different graphical presentations of the same thing. Not in a document where the author is discussing the differences between them, for example. The where the author is discussing the differences doesn't help in deciding whether to encode one or two characters. A document may discuss the roman and italic versions of a character, or the Times and Palatino versions of a character, or different versions of Times fonts for the same character, and so on. It's very clear that we would get nowhere if we wanted to encode all these. In simpler words, you cannot use the needs of discussions about encoding (the meta-level) to determine encodings. Regards,Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Re: High dot/dot above punctuation?
Hello Joanma, On 2010/07/30 12:05, Juanma Barranquero wrote: On Fri, Jul 30, 2010 at 04:52, Martin J. Dürstdue...@it.aoyama.ac.jp wrote: It's very clear that we would get nowhere if we wanted to encode all these. The comment I respondend to talked about characters that are already encoded. Sorry, I didn't get that. In simpler words, you cannot use the needs of discussions about encoding (the meta-level) to determine encodings. Discussing arabic versus latin numerals is not more meta-level that talking about upper vs. lowercase. Yes indeed. If these distinctions were only necessary when talking *about* these characters (meta-level) rather than when just using them (non-meta), then I would indeed agree that there is no reason to encode them separately. Regards,Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Re: Most complete (free) Chinese font?
Hello Michael, I hope you still remember that I am one of the (apparently very few) people who paid for Everson Mono. That was more than ten years ago. On 2010/08/03 1:02, Michael Everson wrote: On 2 Aug 2010, at 13:10, Leonardo Boiko wrote: When did I say there was something shameful about non-freeness? I only said, and I quote, that it’s not my thing. I find the term non-free to smack of élitism and a view that commerce is undesirable. And I'm not even very good at being a merchant. Instead of criticising a term, would you mind proposing a different term? It’s much simpler, for me, to stick to an automated system that guarantees freedom. Indeed? Let us weep for those benighted folks who shackled themselves to the world of pecuniary transaction by choosing to render a shareware fee for Everson Mono Nobody has to weep for me. I actually haven't used Everson Mono much, I'm not even sure whether I ever used it, but at the time I found the idea that somebody was working on a font that covered Unicode really worthy of support. Regards,Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Results of public Review Issues (in particular #121)
Dear Unicode Experts, In a discussion about a new protocol, there was some issue about how to replace illegal bytes in UTF-8 with U+FFFD. That let me remember that there was once a Public Review Issue about this, and that as a result, I added something to the Ruby (programming language) codebase. I traced this back to the method test_public_review_issue_121 added at http://svn.ruby-lang.org/cgi-bin/viewvc.cgi/trunk/test/ruby/test_transcode.rb?r1=18291r2=18290pathrev=18291, and from there to http://www.unicode.org/review/pr-121.html. What I now would like to know is what became of the UTC tentative preference for option #2, and where this is documented, and if possible, which other programming languages and libraries use or don't use this preference. On a higher level, this also suggests that it would be very good to add a bit more of (meta)data to these review issues, such as date opened and date closed and resolution. After manipulating the URI a bit, I got to http://www.unicode.org/review/ and from there to http://www.unicode.org/review/resolved-pri-100.html, where I can find: Resolution: Closed 2008-08-29. The UTC decided to adopt option 2 of the PRI. This should be directly linked from http://www.unicode.org/review/pr-121.html (or just put that information on that page). Also, I'm still interested about where the result of this resolution is nailed down (a new version of the standard, with chapter and verse, or a TR or some such. Regards,Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters
On 2010/08/05 2:56, Asmus Freytag wrote: On 8/2/2010 5:04 PM, Karl Pentzlin wrote: I have compiled a draft proposal: Proposal to add Variation Sequences for Latin and Cyrillic letters The draft can be downloaded at: http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB). The final proposal is intended to be submitted for the next UTC starting next Monday (August 9). Any comments are welcome. - Karl Pentzlin This is an interesting proposal to deal with the glyph selection problem caused by the unification process inherent in character encoding. When Unicode was first contemplated, the web did not exist and the expectation was that it would nearly always be possible to specify the font to be used for a given text and that selecting a font would give the correct glyph. The Web may finally get to solve this problem, although it may still take some time to be fully deployed. Please see http://www.w3.org/Fonts/ for more details and pointers. Regards,Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Re: Proposed Update Unicode Technical Standard #46 (Unicode IDNA Compatibility Processing)
On 2010/09/23 5:10, Markus Scherer wrote: No mistake here: The 63-octet limitation only applies when generating a string for the DNS lookup, that is, in the ToASCII operation. It makes no sense to count DNS octets in a ToUnicode result. The test file has the appropriate error code for the ToASCII result, and the normal string result for ToUnicode. Yes indeed. For some actual examples of very long URIs (which actualy resolve), see tests 121 (single long label) and 122 at http://www.w3.org/2004/04/uri-rel-test.html. Also, for a discussion of potential length limits in IDNAbis on Unicode strings (which, for many good reasons, were ultimately rejected), please see the discussion around http://lists.w3.org/Archives/Public/public-iri/2009Sep/0064.html. Regards,Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Re: First posting to list: Unicode.org: unicode - punycode converter tool?
On 2010/10/30 9:17, Markus Scherer wrote: On Fri, Oct 29, 2010 at 3:57 PM, JP Blankert (thuis PC based) jpblank...@zonnet.nl wrote: Dear unicode.org interested, I discovered at least 1 flaw in the converter tools I used so far (as Verisign's IDN to punycode converter): none of the ones I checkes recognises the German character ß (the sz, as from 'Straße' ) correctly, the sign is always dissolved in ss. This is standard IDNA2003 behavior. Yes. It is usually desirable It is desirable in searching, but it wasn't desirable in domain names. The reason it got into IDNA2003 is because the IETF was looking for data to do case mapping beyond ASCII, and the data available from the Unicode consortium included the 'ß' - ss mapping, and the IETF didn't want to change it because they feared that might start all kinds of discussions on all kinds of (essentially unrelated) issues. because a) many German speakers are unsure about when exactly to use ß vs. ss, Yes, but for many names, it's either one or the other. Essentially, no rules. b) the spelling reform a few years ago changed the rules, Yes. They got way easier and more straightforward. and c) Switzerland does not use ß at all in German. Yes. But that's no reason to take it away from those who use it. (at least myself being Swiss I don't think so) This means that for most purposes it is counter-productive (and can be a security risk) to distinguish ß and ss. Well, it can be a security risk to distinguish between 'i' and 'l' and '1', and so on, and nevertheless, it's being done for good reasons all the time. IDNA2008, an incompatible update, by itself does not map characters. What's more important, IDNA2008 allows the 'ß' as is. UTS #46 provides a compatibility bridge for both IDNA2003 and IDNA2008, and the ß behavior is an option there. Yes. The basic idea in TR #46 is that in a first phase, 'ß' is mapped to 'ss' for lookup, to give registries with German clients a chance to their clients to register true 'ß' where necessary. After that, the mapping can be dropped, so as in the (somewhat distant) future to allow for cases where a name with 'ß' and a name with 'ss' are resolved differently. Regards, Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Re: Utility to report and repair broken surrogate pairs in UTF-16 text
There is charlint (http://www.w3.org/International/charlint/), which is based on UTF-8. It may be possible to adapt it to UTF-16/32. Regards, Martin. On 2010/11/04 4:37, Jim Monty wrote: Is there a utility, preferably open source and written in C, that inspects UTF-16/UTF-16BE/UTF-16LE text and identifies broken surrogate pairs and illegal characters? Ideally, the utility can both report illegal code units and repair them by replacing them with U+FFFD. Jim Monty -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Re: Utility to report and repair broken surrogate pairs in UTF-16 text
On 2010/11/05 2:46, Markus Scherer wrote: 16-bit Unicode is convenient in that when you find an unpaired surrogate (that is, it's not well-formed UTF-16) you can usually just treat it like a surrogate code point which normally has default properties much like an unassigned code point or noncharacter. It case-maps to itself, normalizes to itself, has default Unicode property values (except for the general category), etc. Well, yes, you can handle it that way, but that's pretty much GIGO (garbage in, garbage out) and dumping the problem on the next person/software downwards in the datastream. Also, while some things might still work, much stuff won't, e.g. when you try to find a word (with some lone surrogate hidden in some place) starting with the same word (but with some lone surrogate hidden in another place, or no such surrogate). In other words, when you process 16-bit Unicode text it takes no effort to handle unpaired surrogates, other than making sure that you only assemble a supplementary code point when a lead surrogate is really followed by a trail surrogate. Hence little need for cleanup functions -- but if you need one, it's trivial to write one for UTF-16. For some processing this is true, but it's rather short-sighted. Regards,Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Re: Utility to report and repair broken surrogate pairs in UTF-16 text
On 2010/11/05 8:30, Markus Scherer wrote: If the conversion libraries you are using do not support this (I don't know), then you could ask for such options. Or use conversion libraries that do support such options (like ICU and Java). The encoding conversion library in Ruby 1.9 also supports this. Here's an example: utf16_borken = \x00a\x00b\xD8\x00\x00c\x00d.force_encoding('UTF-16BE') utf8_clean = utf16_borken.encode('UTF-8', invalid: :replace, replace: '') puts utf8_clean # prints abcd In general, and in particular for Unicode Encoding Forms, it's a bad idea to just replace with nothing, because of the security implications this might have. I guess that's the reason Perl doesn't allow this. But if you are sure there are no security implications, then there is no reason to not remove lone surrogates. Regards, Martin. P.S.: Why would you use Ruby for conversion when programming in Perl? You could just as well program in Ruby, it's much more fun! -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Fwd: RFC 6082 on Deprecating Unicode Language Tag Characters: RFC 2482 is Historic
FYI. Regards, Martin. Original Message Subject: RFC 6082 on Deprecating Unicode Language Tag Characters: RFC 2482 is Historic Date: Sun, 7 Nov 2010 21:50:44 -0800 (PST) From: rfc-edi...@rfc-editor.org To: ietf-annou...@ietf.org, rfc-d...@rfc-editor.org CC: rfc-edi...@rfc-editor.org A new Request for Comments is now available in online RFC libraries. RFC 6082 Title: Deprecating Unicode Language Tag Characters: RFC 2482 is Historic Author: K. Whistler, G. Adams, M. Duerst, R. Presuhn, Ed., J. Klensin Status: Informational Stream: IETF Date: November 2010 Mailbox:k...@sybase.com, gl...@skynav.com, due...@it.aoyama.ac.jp, randy_pres...@mindspring.com, john+i...@jck.com Pages: 4 Characters: 6633 Obsoletes: RFC2482 I-D Tag:draft-presuhn-rfc2482-historic-02.txt URL:http://www.rfc-editor.org/rfc/rfc6082.txt RFC 2482, Language Tagging in Unicode Plain Text, describes a mechanism for using special Unicode language tag characters to identify languages when needed without more general markup such as that provided by XML. The Unicode Consortium has deprecated that facility and strongly recommends against its use. RFC 2482 has been moved to Historic status to reduce the possibility that Internet implementers would consider that system an appropriate mechanism for identifying languages. This document is not an Internet Standards Track specification; it is published for informational purposes. INFORMATIONAL: This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited. This announcement is sent to the IETF-Announce and rfc-dist lists. To subscribe or unsubscribe, see http://www.ietf.org/mailman/listinfo/ietf-announce http://mailman.rfc-editor.org/mailman/listinfo/rfc-dist For searching the RFC series, see http://www.rfc-editor.org/rfcsearch.html. For downloading RFCs, see http://www.rfc-editor.org/rfc.html. Requests for special distribution should be addressed to either the author of the RFC in question, or to rfc-edi...@rfc-editor.org. Unless specifically noted otherwise on the RFC itself, all RFCs are for unlimited distribution. The RFC Editor Team Association Management Solutions, LLC ___ IETF-Announce mailing list ietf-annou...@ietf.org https://www.ietf.org/mailman/listinfo/ietf-announce
Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?
On 2010/11/11 6:28, Mark Davis ☕ wrote: That is actually not the case. There are superset relations among some of the CJK character sets, and also -- practically speaking -- between some of the windows and ISO-8859 sets. I say practically speaking because in general environments, the C1 controls are really unused, so where a non ISO-8859 set is same except for 80..9F you can treat it pragmatically as a superset. Yes, except that the terms superset/subset (and set in general) shouldn't be used unless you really strictly speak about the repertoire of characters, and not the encoding itself. So e.g. the repertoire of iso-8859-1 is a subset of the repertoire of UTF-8. However, iso-8859-1 is not a subset of UTF-8, not because you can't label some text encoded as iso-8859-1, but because subset relationships among the encodings themselves don't make sense). Also, US-ASCII is not a subset of UTF-8, because when you just use the names of the character encodings, you mean the character encodings, and character encodings don't have subset relationships. It may as well be possible to use (create?) the term sub-encoding, saying that an encoding A is a sub-encoding of encoding B if all (legal) byte sequences in encoding A are also legal byte sequences in encoding B and are interpreted as the same characters in both cases. In this sense, US-ASCII is clearly a sub-encoding of UTF-8, as well as a sub-encoding of many other encodings. You can also say that iso-8859-1 is a sub-encoding of windows-1252 if the former is interpreted as not including the C1 range. Regards, Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:due...@it.aoyama.ac.jp
Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)
On 2011/07/15 18:51, Michael Everson wrote: On 15 Jul 2011, at 09:47, Andrew West wrote: If you want a font to display a visible glyph for a format or space character then you should just map the glyph to its character in the font, as many fonts already do for certain format characters. Sometimes I might want to show a dotted box for NBSP and sometimes a real NBSP. Or many other characters. Or show a RTL and LTR override character without actually overriding the text. You'd need a picture for that, because just putting in a glyph for it would also override the text. I understand the need. But then what happens is that we need a picture in the standard for the character that depicts an RLO (but isn't actually one). And then you need another character to show that picture, and so on ad infinitum. This doesn't scale. If we take the needs of charaacter encoding experts when they write *about* characters to decide what to make a character, then we get many too many characters encoded. That's similar to the need of typographers when they talk about different character shapes. If we had encoded a Roman 'a' and an Italic 'a' separately just because the distinction shows up explicitly in some texts on typography, that would have been a mistake (the separation is now available for IPA, but that's a separate issue). Regards, Martin.
Re: [bidi] Re: PRI 185 Revision of UBA for improved display of URL/IRIs
Hello Mark, others, On 2011/07/28 5:01, Mark Davis ☕ wrote: Just to remind people: posting to this list does *not* mean submitting to the UTC. If you want to discuss a proposal here, not a problem, but just remember that if you want any action you have to submit to the UTC. Unicode members via: http://www.unicode.org/members/docsubmit.html Others via: http://www.unicode.org/reporting.html [I'll copy this text to the i...@ietf.org mailing list (mailing list of the EAI (Email Address Internationalization) WG, to have a public record, because that's the mailing list where most of the discussion about this draft in the IETF happened, as far as I'm aware of.] Context === I'm an individual Unicode member, but I'll paste this in to the reporting form because that's easier. Please make a 'document' out of it (or more than one, if that helps to better address the issues raised here). I apologize for being late with my comments. Substantive Comments On substance, I don't agree with every detail of what Jonathan Rosenne, Behdad Esfahbod, Aharon Lanin and others have said, I agree with them in general. If their documents/messages are not properly submitted, I include them herewith by reference. The proposal is an enormous change in the Bidi algorithm, changing its nature in huge ways. Whatever the details eventually may look like, it won't be possible to get everything right in one step, and probably countless tweaks will follow (not that they necessarily will make things better, though). Also, dealing with IRIs will increase the appetite/pressure for dealing with various other syntactical constructs in texts. The introduction of the new algorithm will create numerous compatibility issues (and attack surfaces for phishing, the main thing the proposal tries to address) for a long period of time. Given that the Unicode Consortium has been working hard to address (compared to this issue) even extremely minor compatibility issues re. IDNs in TR46, it's difficult for me to see how this fits together. Taking One Step Back As one of the first people involved with what's now called IDNs and IRIs, I know that the problem of such Bidi identifiers is extremely hard. The IETF, as the standards organization responsible for (Internationalized) Domain Names and for URIs/IRIs, has taken some steps to address it (there's a Bidi section in RFC 3987 (http://tools.ietf.org/html/rfc3987#section-4), and for IDNs, there is http://tools.ietf.org/html/rfc5893). I don't think these are necessarily sufficient or anything. And I don't think that the proposal at hand is completely useless. However, the proposal touches many aspects (e.g. recognizing IRIs in plain text,...) that are vastly more adequate for definition in another standards organization or where a high-bandwidth coordination with such an organization is crucial (roughly speaking, first on feasibility of various approaches, then on how to split up the work between the relevant organizations, then on coordination in details.) Without such a step back and high-bandwidth coordination, there is a strong chance of producing something highly suboptimal. (Side comment on detail: It would be better for the document to use something like http://tools.ietf.org/html/rfc3987#section-2.2 rather than the totally obscure and no longer maintained http://rfc-ref.org/RFC-TEXTS/3987/chapter2.html, in the same way the Unicode Consortium would probably prefer to have its own Web site referenced for its work rather than some third-party Web site.) Taking Another Step Back I mention 'high-bandwidth' above. The Unicode Public Review process is definitely not suited for this. It has various problems: - The announcements are often very short, formalistic, and cryptic (I can dig up examples if needed.) - The announcements go to a single list; forwarding them to other relevant places is mostly a matter of chance. This should be improved by identifying the relevant parties and contacting them directly. - To find the Web form, one has to traverse several links. - The submission is via a Web form, without any confirmation that the comment has been received. - The space for comments on the form is very small. - There is no way to make a comment public (except for publishing it separately). - There is no official response to a comment submitted to the Web form. One finds out about what happened by chance or not at all. (compare to W3C process, where WGs are required to address each comment formally, and most of them including the responses are public) - The turnaround is slow. Decisions get made (or postponed) at UTCs only. Overall, from an outsider's point of view, the review process and the review form feel like a needle's ear connected to a black hole. [I very much understand that part of the reason the UTC works the way it works is because of
Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?
On 2011/09/10 9:32, Stephan Stiller wrote: Actually, I *was* talking about purely typographic/aesthetic ligatures as well. I'm aware that which di-/trigraphs need to be considered from a font design perspective is language-dependent. And this language-dependence is not only a question of letter combination frequency, but also of aesthetic preference. What I have heard very often is that Frenchs has a preference for using many ligatures, whereas Italian uses almost none. But the point is that I observe that: (a) aesthetic ligatures are not frequently seen in modern German print and (b) the absence of such ligatures doesn't offend me (in modern German print). I think part of that comes from the fact that with modern DTP, lots of fonts are used across languages without any particular adjustments with respect to ligatures. (This may not be the case for high-end order-made fonts used by publishing houses, but it's certainly true for the run-of-the mill Times Roman, Helvetica, and so on used on PCs.) Typography is always an interplay between designer, reader, and technology. So what probably happened is that the technology-induced use of the same fonts across languages let to designs with less language-specific ligatures (essentially lowest-common-denominators in terms of ligatures) and to an adjustment of the designs so that this infrequency of ligatures would be less visible. Also, you and other readers got used to these designs. Regards,Martin. It could be - and a quick visual check confirms this - that the fonts used for printing of {novels, school textbooks, tech/science books, ...} and the associated kerning tables don't necessitate ligatures or have traditionally (fwiw) not been seen as necessitating them. Enough professional publishing houses I _think_ don't use aesthetic ligatures, so that, whenever I do see them in German text, they stand out to me. So /de facto/ usage of aesthetic ligatures seems a bit like a locale parameter to me. That said - if I'm really factually wrong (and ligatures in modern German text are just so subtle and pervasive that I never took notice), people on the list please feel free to correct me. Stephan On 9/9/2011 4:14 PM, Kent Karlsson wrote: I was talking about purely typographic ligatures, in particular ligatures used because the glyphs (normally spaced) would otherwise overlap in an unpleasing manner. If the glyphs don't overlap (or there is extra spacing, which is quite ugly in itself if used in normal text), no need to use a (purely typographic) ligature. So it is a font design issue. (And then there are also ornamental typographic ligatures, like the st ligature, but those are outside of what I was talking about here.) But of course, which pairs of letters (or indeed also punctuation) are likely to occur adjacently is language dependent. /Kent K Den 2011-09-09 23:45, skrev Stephan Stillersstil...@stanford.edu: Pardon my asking, as this is not my specialty: There are several other ligatures that *should* be formed (automatically) by run of the mill fonts: for instance the fj ligature, just to mention one that I find particularly important (and that does not have a compatibility code point). About the should - isn't this language-dependent? For example I recall that ordinary German print literature barely uses any ligatures at all these days (ie: I'm not talking about historical texts). And, has anyone ever attempted to catalogue such ligature practices? (Is this suitable for CLDR?) (I also recall being taken aback by the odd look of ligatures in many LaTeX-typeset English scientific documents, but I suspect that's rather because some of the commonly used fonts there are lacking in aesthetic design.) Stephan
Re: continue: Glaring Mistake in nomenclature
Hello Delex, On 2011/09/14 15:55, delex r wrote: The “Dark age of Assamese language” ran for about 37 years in this region when it was tried to kill a the language by vested interests with the help of British Political powers imposing Bengali as medium of instruction in school and colleges and for all official purpose. That sounds like a very sad story, but a long time ago. Please think about how you can affect the future, because you can't change the past. I think now naming the script as “ Bengali” that too by stealing two unique letters from the Assamese alphabet list and coloring them with Bengali hue is part of that notorious linguistic invasion. No, these letters clearly belong to the same script. That the script was named Bengali in the standard may be unfortunate, in particular from your viewpoint, but as far as the official standards are concerned, it can't be changed (as many others already have told you). Please note that you (and anybody else) can call this script whatever you think is most appropriate. What I think you might be able to ask for is to have some annotation for the two letters in question, in the same way as e.g. the Arabic block has lots of annotations for what language uses which character for those characters that are not part of the base Arabic alphabet. But why don't you look out for things you can change, and that would be much more productive to help your goal of furthering the Assamese language. For example: a) Check what the problems (if any) there are with technologies such as CSS for styling,... to be able to use Assamese without problems on the Internet and the Web and elsewhere. (If you find something, please direct any comments to the relevant mailing lists, and not to this one.) b) (this one is easier and requires more manpower): Contribute to the Assamese language by publishing content, contributing to Web sites such as Wikipedia, and so on. As an example, it looks as if the Wikipedia article on the Assamese language in the Assamese language (http://as.wikipedia.org/wiki/অসমীয়া_ভাষা) is still quite incomplete. Regards, Martin.
Re: Civil suit; ftp shutdown; mailing list shutdown
[By accident, I sent this only to Ken first; he recommended I send it to both Unicode and Unicore.] I have sent a mail to a relevant IETF list (apps-disc...@ietf.org); the IETF was looking into taking this over, with http://tools.ietf.org/html/draft-lear-iana-timezone-database-04, but apparently, Unicode got alerted first. In terms of practical matters, two points seem important to me: First, to ask the judge for a temporary permission (there's a better legal term, but IANAL) to keep the database up until the law suit is settled (because the database is probably down now due to a temporary order from the judge to that effect) because of its high practical importance. Second, what seems to be in dispute is data about old history. While this is important for some applications, in most applications, present and new data is much more important, so one way to avoid problems would be to publish only new data at some new place until the case is settled. That would mean that applications would have to be checked for whether they need the old data or not. Or to only publish diffs (which would be about new, present-day data not from the source under litigation). Regards, Martin. On 2011/10/07 4:45, Ken Lunde wrote: Arle and others, The URL for the following blog post was tweeted a few minutes ago: http://blog.joda.org/2011/10/today-time-zone-database-was-closed.html -- Ken On Oct 6, 2011, at 9:45 AM, Arle Lommel wrote: Is there any public information about the lawsuit? I was stunned to see the forwarded mail and want to understand the implications of this lawsuit, but I can't find any news about it other than Arthur’s rather telegraphic note. I understand that he may not be able to comment given pending litigation, but if we had any information at all about what the suit is, it might help clarify if there is any need for concern. -Arle It would be nice, but I don't think the Consortium can do that without first understanding if it gets exposed to its own lawsuit. Eric.
Re: Civil suit; ftp shutdown; mailing list shutdown
Unicode peolpe: To follow this subject, I recommend to look through http://mm.icann.org/pipermail/tz/ or subscribe to that mailing list at https://mm.icann.org/mailman/listinfo/tz. In addition, please see http://www.ietf.org/mail-archive/web/apps-discuss/current/msg03374.html. Regards,Martin. On 2011/10/07 14:14, Martin J. Dürst wrote: [By accident, I sent this only to Ken first; he recommended I send it to both Unicode and Unicore.] I have sent a mail to a relevant IETF list (apps-disc...@ietf.org); the IETF was looking into taking this over, with http://tools.ietf.org/html/draft-lear-iana-timezone-database-04, but apparently, Unicode got alerted first. In terms of practical matters, two points seem important to me: First, to ask the judge for a temporary permission (there's a better legal term, but IANAL) to keep the database up until the law suit is settled (because the database is probably down now due to a temporary order from the judge to that effect) because of its high practical importance. Second, what seems to be in dispute is data about old history. While this is important for some applications, in most applications, present and new data is much more important, so one way to avoid problems would be to publish only new data at some new place until the case is settled. That would mean that applications would have to be checked for whether they need the old data or not. Or to only publish diffs (which would be about new, present-day data not from the source under litigation). Regards, Martin. On 2011/10/07 4:45, Ken Lunde wrote: Arle and others, The URL for the following blog post was tweeted a few minutes ago: http://blog.joda.org/2011/10/today-time-zone-database-was-closed.html -- Ken On Oct 6, 2011, at 9:45 AM, Arle Lommel wrote: Is there any public information about the lawsuit? I was stunned to see the forwarded mail and want to understand the implications of this lawsuit, but I can't find any news about it other than Arthur’s rather telegraphic note. I understand that he may not be able to comment given pending litigation, but if we had any information at all about what the suit is, it might help clarify if there is any need for concern. -Arle It would be nice, but I don't think the Consortium can do that without first understanding if it gets exposed to its own lawsuit. Eric.
Re: about P1 part of BIDI alogrithm
On 2011/10/10 21:10, Eli Zaretskii wrote: Date: Mon, 10 Oct 2011 17:47:21 +0800 From: li bolibo@gmail.com From section 3: Paragraphs are divided by the Paragraph Separator or appropriate Newline Function (for guidelines on the handling of CR, LF, and CRLF, see Section 4.4, Directionality, and Section 5.8, Newline Guidelines of [Unicode]). Paragraphs may also be determined by higher-level protocols: for example, the text in two different cells of a table will be in different paragraphs. I think only 'Enter' and '*Paragraph separator*' can do paragraph breaking. In addition to the Paragraph Separator, _any_ newline function (LF, CR+LF, CR, or NEL) can end a paragraph. Also U+2028, the LS character. See section 5.8 of the Unicode Standard cited above. No, U+2028 (LS) is explicitly *not* a Paragraph Separator. It just indicates where to break a line (rather than leaving that to the implementation), but doesn't restart the Bidi algorithm. Regards, Martin.
Re: Solidus variations
On 2011/10/11 7:35, Philippe Verdy wrote: I've seen various interpretations, but the ASCII solidus is unambiguously used with a strong left-to-right associativity, and the same occurs in classical mathematics notations (the horizontal bar is another notation but even where it is used, it also has the equivalent top-to-bottom associatity). Horizontal bars surely work by using bars of differing length, with shorter bars having higher priority. Horizontal bars of equal length would be very weird. Regards, Martin.
Re: about P1 part of BIDI alogrithm
On 2011/10/11 10:29, Martin J. Dürst wrote: On 2011/10/10 21:10, Eli Zaretskii wrote: Date: Mon, 10 Oct 2011 17:47:21 +0800 In addition to the Paragraph Separator, _any_ newline function (LF, CR+LF, CR, or NEL) can end a paragraph. Also U+2028, the LS character. See section 5.8 of the Unicode Standard cited above. No, U+2028 (LS) is explicitly *not* a Paragraph Separator. It just indicates where to break a line (rather than leaving that to the implementation), but doesn't restart the Bidi algorithm. I might add here that 'break a line' in the Bidi algorithm is done before actual reordering (which is done line-by-line), but after calculating all the levels. This is different from what you did in Emacs, which I'd call line-folding, i.e. cut the line after a paragraph is laid out and reordered completely as a single (potentially very long) line. This makes some sense in Emacs, where the basic assumption is that lines should fit into the width of the view. Regards,Martin.
Re: about P1 part of BIDI alogrithm
On 2011/10/11 13:07, Eli Zaretskii wrote: Date: Tue, 11 Oct 2011 10:53:39 +0900 From: Martin J. Dürstdue...@it.aoyama.ac.jp CC: li bolibo@gmail.com, unicode@unicode.org This is different from what you did in Emacs, which I'd call line-folding, i.e. cut the line after a paragraph is laid out and reordered completely as a single (potentially very long) line. This makes some sense in Emacs, where the basic assumption is that lines should fit into the width of the view. Sorry, I don't follow you. There's no such line-folding in the Emacs implementation of the UBA. A line that doesn't fit the window width is reordered as a whole. Conceptually, reordering is done before breaking a long line into continuation lines. This is exactly what I meant. In Emacs, reordering is done before breaking a long line into smaller segments to fit into the width of the display window. I called this line-folding, you call it continuation lines. But in the bidi algorithm itself, line breaking (be it automatic due to a layout algorithm or explicit due to LS or something similar) is applied *before* reordering. This is very important, because otherwise, content that is logically earlier may appear on later lines, which would be very confusing for readers. Regards, Martin.
Re: about P1 part of BIDI alogrithm
Hello Eli, There is absolutely no problem to treat the algorithm in UAX#9 as a set of requirements, and come up with a totally different implementation that produces the same results. I think actually UAX#9 says so somewhere. But what is, strictly speaking, not allowed is to change the requirements. One requirement of the algorithm is that when lines are broken, logically earlier characters stay on earlier lines, and logically later characters move to later lines. In this respect, your implementation doesn't conform to UAX#9. There's an external reason for this, and an internal one. The external reason is that continuation lines in Emacs are in general just an overflow device, text in Emacs isn't supposed to be broken into lines in the same way as e.g. word processors break lines to form paragraphs. I'm not sure how much it is true (line breaks often e.g. interfere with formatting in Japanese and other languages that don't use spaces between words and don't work well with a convention of convert a line break in the source to a space in the output), but I think to some extent it is true. The internal reason is the one you describe below. It may indeed be a strong reason from an implementation perspective, but from an user perspective, it's a very weak reason. Also, I don't understand it fully. You say that the Emacs display engine examines each character in turn. Assuming these are in logical order, you would just examine them up to the point where you have about one line of glyphs. There would indeed be a bit of back and forth there because of the interaction between bidi algorithm and glyph selection (but as far as I know, mirrored glyphs mostly have the same width as their originals). Anyway, that bit of back and forth seems to be much less of a problem than the back and forth that you get when you have to reorder over much larger distances because you're essentially considering a whole paragraph as a single line. But I'm not an expert in Emacs display engine details, so I can't say for sure. Regards, Martin. On 2011/10/11 16:43, Eli Zaretskii wrote: Date: Tue, 11 Oct 2011 10:53:39 +0900 From: Martin J. Dürstdue...@it.aoyama.ac.jp CC: li bolibo@gmail.com, unicode@unicode.org I might add here that 'break a line' in the Bidi algorithm is done before actual reordering (which is done line-by-line), but after calculating all the levels. Please be aware that this separation of the UBA into phases makes no sense at all in the context of Emacs display engine. The UBA is written from the POV of batch processing of a block of text -- you pass in a string in logical order, and receive a reordered string in return. The UBA describes the processing as a series of phases, each one of which is completed for all the characters in the block of text before the next phase begins. By contrast, the Emacs display engine examines the text to display one character at a time. For each character, it loads the necessary display and typeface information, and then decides whether it will fit the display line. Then it examines the next character, and so on. It should be clear that processing characters one by one completely disrupts the subdivision of the UBA into the phases that include examination of more than that single character, let alone decisions of where to break the line, because reordering can no longer be done line by line. Let me give you just one example: if the character should be mirrored, you cannot decide whether it fits the display line until _after_ you know what its mirrored glyph looks like. But mirroring is only resolved at a very late stage of reordering, so if you want to reorder _after_ breaking into display lines, you will have to back up and reconsider that decision after reordering, which will slow you down. Given these considerations, it is a small wonder that the UBA implementation inside Emacs is _very_ different from the description in UAX#9. Therefore, the subdivision into phases that are on the line and higher levels makes very little sense here, since the implementation needed to produce an identical result while performing a significant surgery on the algorithm description. In effect, the UBA implementation in Emacs treated UAX#9 as a set of requirements, not as a high-level description of the implementation.
Re: about P1 part of BIDI alogrithm
Hello Kent, I was also very much thinking that mirrored glyph should be of the same width, but there might be subtle issues when you consider kerning. As a very basic example, think about kerning of the pair K), and then think about K(. Regards, Martin. On 2011/10/11 19:39, Kent Karlsson wrote: Den 2011-10-11 09:43, skrev Eli Zaretskiie...@gnu.org: Let me give you just one example: if the character should be mirrored, you cannot decide whether it fits the display line until _after_ you know what its mirrored glyph looks like. But mirroring is only resolved at a very late stage of reordering, so if you want to reorder _after_ breaking into display lines, you will have to back up and reconsider that decision after reordering, which will slow you down. Well, I think there is a silent (but reasonable, I would say) assumption that mirroring does not change the width of a glyph... I would think that if a font does not fulfill that, then you have a font problem (or mix of fonts problem), not a bidi problem. Glyphs for characters that may mirror do not normally form ligatures with other glyphs; and even if they do, the width of the ligature should not change relative to the total with of the preligature glyphs involving glyphs for mirrorable characters (and if it does change anyway, you again have a font problem that may result in a somewhat ugly display that should be fixed by fixing the font, not a bidi problem). I'm not thinking about Emacs here, but in general. IMHO /Kent K
Wrong UTF-8 encoders still around?
I'm hoping to get some advice from people with experience with various Unicode/transcoding libraries. RFC 3987 (the current IRI spec) has the following text: Note: Some older software transcoding to UTF-8 may produce illegal output for some input, in particular for characters outside the BMP (Basic Multilingual Plane). As an example, for the IRI with non-BMP characters (in XML Notation): http://example.com/#x10300;#x10301;#x10302;; which contains the first three letters of the Old Italic alphabet, the correct conversion to a URI is http://example.com/%F0%90%8C%80%F0%90%8C%81%F0%90%8C%82; We are thinking about removing this because we hope that software has improved in the meantime, but we would like to be sure about this. If anybody knows about software out there that still presents this problems, please tell us. Thanks,Martin.
Forum Problems
How can one use the Forum to comment on URI/IRI issues when one gets a message: Your message contains too many URLs. The maximum number of URLs allowed is 8. I never liked this forum stuff too much, and this hasn't made things better :-(. Regards, Martin.
Default bidi ranges
I tried to find something like a normative description of the default bidi class of unassigned code points. In UTR #9, it says (http://www.unicode.org/reports/tr9/tr9-23.html#Bidirectional_Character_Types): Unassigned characters are given strong types in the algorithm. This is an explicit exception to the general Unicode conformance requirements with respect to unassigned characters. As characters become assigned in the future, these bidirectional types may change. For assignments to character types, see DerivedBidiClass.txt [DerivedBIDI] in the [UCD]. The DerivedBidiClass.txt file, as far as I understand, is mainly a condensation of bidi classes into character ranges (rather than giving them for each codepoint independently as in UnicodeData.txt). I.e. it can at any moment be derived automatically from UnicodeData.txt, and is as such not normative. Why is it then that the default class assignments are only given in this file (unless I have overlooked something)? And why is it that they are only given in comments? I'm trying to create a program that takes all the bidi assignments (including default ones) and creates the data part of a bidi algorithm implementation, but I don't feel confident to code against stuff that's in comments. Any advice? Is it possible that this could be fixed (making it more normative, and putting it in a form that's easier to process automatically)? Regards, Martin.
Re: missing characters: combining marks above runs of more than 2 base letters
On 2011/11/21 5:54, Asmus Freytag wrote: On 11/20/2011 8:00 AM, Joó Ádám wrote: Leaving aside that CSS is presentation and not content, and is definitely not markup. HTML is a better candidate. Á The details of the appearance of the mark would be presentation. The scoping, like for applying every other style feature, would have to be supplied via HTML, XML you name it. I can see where you'd want something other than a generic span to provide that scoping. I agree with Asmus here. It's important to point out that having it in CSS doesn't mean that it couldn't also go into HTML. But these days, anything presentational goes into CSS, and if there's markup with a default presentation, then HTML just mentions the markup, and for presentation defers to CSS. Putting it in CSS also means that it can be used from other kinds of markup (e.g. totally unrelated to HTML or even XML). If you want to make serious progress, I propose to check what's in TEI (because that, and not HTML, is the markup of choice for these kinds of text). If what's currently in TEI isn't sufficient in terms of markup, please work with them to improve the situation. Also, work with CSS to look into the presentation issues. In particular, look at what there's already around from the presentation side of MathML. In HTML, it should always be possible to start generically, i.e. with a span and some class attributes. Regards,Martin.
Re: Unicode, SMS and year 2012
On 2012/04/28 4:26, Mark Davis ☕ wrote: Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values. Because punycode encodes differences between character numbers, not the character numbers themselves, it can indeed be quite efficient in particular if the characters used are tightly packed (e.g. Greek, Hebrew,...). For languages with Latin script and accented characters, the question is how close these accented characters are in Unicode. However, punycode also codes character positions. Because of this, it gets less efficient for longer text. [Because punycode uses (circular) position differences rather than simple positions, this contribution is limited by the (rounded-up binary logarithm of the) weighted average distance between two same characters in the text/language.] My guess is therefore that punycode won't necessarily be super-efficient for texts in the 100+ character range. It's difficult to test quickly because the punycode converters on the Web limit the output to 63 characters, the maximum length of a label in a domain name. Regards,Martin.
Re: Unicode, SMS and year 2012
On 2012/04/28 7:29, Cristian Secară wrote: În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris: Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values. I suspect the punycode goal is to take a wide character set into a restricted character set, without caring much on resulting string length; if the original string happens to be in other character set than the target restricted character set, then the string length increases too much to be of interest in the SMS discussion. Not exactly. Compression was very much a goal when designing punycode. It won against a number of other algorithms as the choice for IDNs and is clearly very good for that purpose. Just do a test: write something in a non-Latin alphabetic script into this page here http://demo.icu-project.org/icu-bin/idnbrowser Well, as a silly example, what about α? (that's 57 α characters). The result is xn--mxa, which is 63 characters long. Regards, Martin.
Re: Unicode, SMS and year 2012
On 2012/04/27 17:06, Cristian Secară wrote: It turned out that they (ETSI its groups) created a way to solve the 70 characters limitation, namely “National Language Single Shift” and “National Language Locking Shift” mechanism. This is described in 3GPP TS 23.038 standard and it was introduced since release 8. In short, it is about a character substitution table, per character or per message, per-language defined. Personally I find this to be a stone-age-like approach, Fully agreed. which in my opinion does not work at all if I enter the message from my PC keyboard via the phone's PC application (because the language cannot always be predicted, mainly if I am using dead keys). It is true that the actual SMS stream limit is not much generous, but I wonder if the SCSU would have been a better approach in terms of i18n. I also don't know if the SCSU requires a language to be prior declared, or it simply guess by itself the required window for each character. The right approach in this case isn't to discuss clever compression techniques (I've indulged in this in my other mails, too, sorry), but to realize that the underlying mobile/wireless technology has advanced a lot. SMSes are simply a relict of outdated technology, sold at a horrendous price. For more information, see e.g. http://mobile.slashdot.org/comments.pl?sid=433536cid=22219254 or http://gthing.net/the-true-price-of-sms-messages. That's even for the case of pure ASCII messages. The solution is simply to stop using SMSes, and upgrade to a better technology. Regards, Martin.
Re: Unicode, SMS and year 2012
On 2012/04/29 18:58, Szelp, A. Sz. wrote: While there are good reasons the authors of HTML5 brought to ignore SCSU or BOCU-1, having excluded UTF-32 which is the most direct, one-to-one mapping of Unicode codepoints to byte values seems shortsighted. Well, except that it's hopelessly inefficient and therefore essentially nobody is using it. We are talking about the whole of Unicode, not just BMP. Yes. For transmission, use UTF-8 (or maybe UTF-16). Regards,Martin.
Flag tags (was: Re: Unicode 6.2 to Support the Turkish Lira Sign)
On 2012/05/29 17:43, Asmus Freytag wrote: On 5/27/2012 5:52 PM, Michael Everson wrote: Get over it. Please just get over it. It doesn't matter. It's a blort. Time to agree with Michael. Get over it, is good advice here. Sovereign countries are free to decree currency symbols, whatever their motivation or the putative artistic or typographic merits of the symbol in question. Not for Unicode to judge. I'd have to agree here. On a slightly (although maybe only slightly) related matter, what about if Unicode didn't judge how difficult it should be to display national flags. Creating a way to display flags from two-tag combinations and then later realizing that a sequence of such tags didn't locally parse, and the whole thing has to be redone, doesn't seem like a very good alternative to just encoding these things (not that I think that just encoding these is a very good alternative either, though). Regards, Martin.
Re: Unicode 6.2 to Support the Turkish Lira Sign
On 2012/05/30 4:42, Roozbeh Pournader wrote: Just look what happened when the Japanese did their own font/character set hack. The backslash/yen problem is still with us, to this day... To be fair, the Japanese Yen at 0x5C was there long before Unicode, in the Japanese version of ISO 646. That it has remained as a font hack is very unfortunate, but for that, not only the Japanese, but also major international vendors are to blame. Regards, Martin.
Re: Too narrowly defined: DIVISION SIGN COLON
On 2012/07/11 4:37, Asmus Freytag wrote: I recall, with certainty, having seen the : in the context of elementary instruction in arithmetic, as in 4 : 2 = ?, but am no longer positive about seeing ÷ in the same context. I remember this very well. In grade school, we had to learn two ways to divide, which were distinguished by using two symbols, ':' and '÷', and different verbs, the German equivalents of divide and measure. I'll explain the difference with two examples: a) There are 12 apples, and four kids. How many apples does each kid get? [answer: 3 apples] b) There are 12 apples, and each kid gets 4 of them. For how many kids will that be enough? [answer: for 3 kids] I think a) was called 'divide' and b) was called 'measure', but I can't remember which symbol was used for which. When we were learning this, I thought it was a bit silly, because the numbers were the same anyway. It seems to have been based on the observation that at a certain stage in the development of arithmetic skills, children may be able to do division (in the general, numeric sense) one way but not the other, or that they get confused about the units in the answer. But while such an observation may be true, I don't think such a stage lasts very long, definitely not as long as we had to keep the distinction (at least through second and third grade). Also, I think this may have been a local phenomenon, both in place and time. But if one searches for geteilt gemessen, one gets links such as this: http://www.niska198.de.tl/Gemessen-oder-Geteilt-f-.htm So maybe some of this is still in use. Regards, Martin.
Re: Too narrowly defined: DIVISION SIGN COLON
On 2012/07/11 10:35, Stephan Stiller wrote: About Martin Dürst's content re geteilt-gemessen: When I attended the German school system in approx the 1990s this distinction wasn't mentioned or taught. (I prefer to not give details about specific time and place for privacy reasons.) Sorry, but I forgot to mention that my experience was in Switzerland, in the late 1960ies. Actually, given that the education system in Switzerland is handled by the Cantons, I should say that it was in the Canton of Zurich. Regards, Martin. From looking into textbooks and formula collections at that time I recall not having found any mention of or explanation for such a differentiation. Given that I also haven't seen many people use that symbol I would suspect that, for some time, this was an elementary school thing in Germany. For me, the symbol ÷ also only ever appeared on calculators. I don't think it appeared ever in primary or secondary school textbooks I've worked with and wasn't used for handwritten arithmetic at my schools either. Stephan PS: Thank you! You've just solved a mystery for me - something I've been told about a long time ago by an older person but couldn't find references for at the time.
Re: Sinhala naming conventions
On 2012/07/11 11:04, Mark E. Shoulson wrote: Ever start to feel that we would have been better off not to give official descriptive names at all? Or else really vague ones like LETTERLIKE THINGY NUMBER 5412? So much blood-pressure raised over the names... I'm feeling that way since about the mid-1990ies, since I discovered that for CJK Ideograms, there is a cop-out of CJK UNIFIED IDEOGRAPH 4E00 and so on. It's also the only place where numerals are allowed in character names. Regards, Martin.
Re: pre-HTML5 and the BOM
On 2012/07/13 0:12, Leif Halvard Silli wrote: Doug Ewell, Wed, 11 Jul 2012 09:12:46 -0600: and people who want to create or modify UTF-8 files which will be consumed by a process that is intolerant of the signature should not use Notepad. That goes for HTML (pre-5) pages [snip] HTML5-parsers MUST support UTF-8. They do not need to support any other encoding. Pre-HTML5-parsers are not required to support the UTF-8 encoding - or any other particular encoding. Up to here, that's indeed what the spec says, except for XHTML, which is XML and therefore includes UTF-8 (and UTF-16) support, but my guess is that you didn't include this. But when they do support the UTF-8 encoding, they are, however, not permitted to be 'intolerant' of the BOM. Where does it say so? Regards, Martin. Thus there is nothing special with regard to the UTF-8 BOM and pre-HTML5 HTML.
Re: pre-HTML5 and the BOM
On 2012/07/13 22:31, Jukka K. Korpela wrote: 2012-07-13 16:12, Leif Halvard Silli wrote: The kind of BOM intolerance I know about in user agents is that some text browsers and IE5 for Mac (abandoned) convert the BOM into a (typically empty) line a the start of the body element. I wonder if there is any evidence of browsers currently in use that have problems with BOM. I'd assume that so-called modern browsers don't have such problems. I suppose such browsers existed, though I can't be sure. They indeed did exist. In any cases, for several years I haven't seen any descriptions of real-life observations, but there are rumors and warnings, and people get disturbed. Even reputable sites have instructions against using BOM: When the BOM is used in web pages or editors for UTF-8 encoded content it can sometimes introduce blank spaces or short sequences of strange-looking characters (such as ). For this reason, it is usually best for interoperability to omit the BOM, when given a choice, for UTF-8 content. http://www.w3.org/International/questions/qa-byte-order-mark This could be toned down a bit, but I still agree (and the Unicode consortium says the same): There may be good reasons to use a BOM, but if these reasons don't apply, then don't use it. Regards,Martin.
Re: pre-HTML5 and the BOM
On 2012/07/14 1:33, Philippe Verdy wrote: Fra: Jukka K. Korpelajkorp...@cs.tut.fi When the BOM is used in web pages or editors for UTF-8 encoded content it can sometimes introduce blank spaces or short sequences of strange-looking characters (such as ). For this reason, it is usually best for interoperability to omit the BOM, when given a choice, for UTF-8 content. http://www.w3.org/International/questions/qa-byte-order-mark This statemant for maximum interoperability may have been true in the past, where Unicode support was not so universal and still not adopted formally for all newer developments in RFCs published by the IETF. But now the situation is reversed : maximum interoperability if offered when BOMs are present, not really to indicate the byte order itself, but to confirm that the content is Unicode encoded and extremely likely to be text content and not arbitrary binary contents (that today almost always use a distinctive leading signature). As you mention the IETF, what people in the IETF like most about UTF-8 is that it's upward-compatible with ASCII. Because the protocol/syntax-relevant part is usually ASCII only, that means that a lot of stuff can work just by making things 8-bit clean (which in this day and age may mean essentially no work in some cases). A BOM anywhere in a protocol therefore just removes the biggest advantage of UTF-8. While it's usually okay to use a BOM at the start of a whole file (or the file equivalent in transmission, which is a MIME entity), anywhere else (e.g. in small protocol fields), a BOM is a big no-no. Regards, Martin.
Re: pre-HTML5 and the BOM
On 2012/07/17 17:22, Leif Halvard Silli wrote: And an argument was put forward in the WHATWG mailinglist earlier tis year/end of previous year, that a page with strict ASCII characters inside could still contain character entities/references for characters outside ASCII. Of course they can. That's the whole point of using numeric character references. I'm rather surprised that this was even discussed in the context of HTML5. For instance, early on in 'the Web', some appeared to think that all non-ASCII had to be represented as entities. Yes indeed. There's still some such stuff around. It's mostly unnecessary, but it doesn't hurt. Regards,Martin.
Re: pre-HTML5 and the BOM
Hello Leif, Sorry to be late with my answer. On 2012/07/13 20:44, Leif Halvard Silli wrote: Martin J. Dürst, Fri, 13 Jul 2012 18:17:05 +0900: On 2012/07/13 0:12, Leif Halvard Silli wrote: Doug Ewell, Wed, 11 Jul 2012 09:12:46 -0600: and people who want to create or modify UTF-8 files which will be consumed by a process that is intolerant of the signature should not use Notepad. That goes for HTML (pre-5) pages [snip] HTML5-parsers MUST support UTF-8. They do not need to support any other encoding. Pre-HTML5-parsers are not required to support the UTF-8 encoding - or any other particular encoding. Up to here, that's indeed what the spec says, except for XHTML, which is XML and therefore includes UTF-8 (and UTF-16) support, but my guess is that you didn't include this. Right. I meant pre-HTML5 HTML as text/html. Not pre-HTML5 HTML as XML. But when they do support the UTF-8 encoding, they are, however, not permitted to be 'intolerant' of the BOM. Where does it say so? What is 'it'? That pre-HTML5 (as text/html) browsers are not permitted to be 'intolerant' of the BOM. HTML5 tells how UAs should use BOM to decide the encoding. By pre-HTML5, I meant the 'text/html' MIME space, though I gave much weight to HTML4 ... I see that HTML4 for UTF-8 points to RFC2279,[1] which was silent about the UTF-8 BOM. Only with RFC3629 from 2003, is the UTF-8 BOM described.[3] Yes exactly. In the RFC 2070 and HTML4 time-frame, nobody that I know was thinking about a BOM for UTF-8. Only later BOMs at the start of HTML4 started to turn up, and browser makers were surprised. Roughly the same happened for XML. Early XML parsers didn't handle the BOM. When Windows notepad started to use the BOM to distinguish between UTF-8 and ANSI (the local system legacy encoding), this BOM leaked into HTML, and was difficult to stop. So XML got updated, and parsers started to get updated, too. As for XML 1.0, then revision 2 from year 2000 appears to be the first time the XML spec describes the UTF-8 BOM.[4] The Appendix C 'profile' of XHTML 1.0 - which was issued year 2000 and revised 2002 - is also part of the text/html MIME registration of June 2000.[5] The MIME contains a general quote of UTF-8 as preferred, but does not talk about the UTF-8 BOM. XHTML 1.0 itself strangely enough does not reflect much on whether XML's default encoding(s) with regard to serving XHTMLm as text/html.[6] Though, it does actually say, appendix C: [7] Remember, however, that when the XML declaration is not included in a document, the document can only use the default character encodings UTF-8 or UTF-16. Here it does sound as if XHTML, even when served according to appendix C, should subject itself to XML's encoding rules. So, given the age of the documents, neither HTML4 from 1999 nor the 'text/html' MIME registration, does not permit anyone to be 'intolerant' of the UTF-8 BOM, but neither does it permit anyone to be 'tolerant' of it. It is silent on the issue. You read silence as not taking sides, which makes sense from your viewpoint. Knowing what implementations did (in a pre-1999 time-frame), the idea of UTF-8 BOM just didn't really exist, so nobody thought about mentioning it. Regards, Martin. RFC3629 says that protocols may restrict usage of the BOM as a signature.[3] However, text/html does not do offer any such restrictions. If one sees HTML4 as as tied to RFC2279 as XML up until and including 4th revision was tied to specific versions of Unicode, then this has not changed. But would it not be natural to consider that text/html user agents currently has to consider RFC3629 as more normative than RFC2279? I do at least not think that user agents that want to be conforming pre-HTML5 user agents have any justification for ignoring the BOM. [1] http://www.w3.org/TR/html401/appendix/notes#h-B.2.1 [2] http://tools.ietf.org/html/rfc2279 [3] http://tools.ietf.org/html/rfc3629#section-6 [4] http://www.w3.org/TR/2000/WD-xml-2e-2814 [5] http://tools.ietf.org/html/rfc2854 [6] http://www.w3.org/TR/xhtml1/#C_9 [7] http://www.w3.org/TR/xhtml1/#C_1 Thus there is nothing special with regard to the UTF-8 BOM and pre-HTML5 HTML.
Re: pre-HTML5 and the BOM
Hello Leif, On 2012/07/18 4:35, Leif Halvard Silli wrote: But is the Windows Notepad really to blame? Pretty much so. There may have been other products from Microsoft that also did it, but with respect to forcing browsers and XML parsers to accept an UTF-8 BOM as a signature, Notepad was definitely the main cause, by far. OK, it was leading the way. But can we think of something that could have worked better, in praxis? And, no, I don't mean 'better' as in 'not leaking the BOM into HTML'. I mean 'better' as in 'spreading the UTF-8 to the masses'. UTF-8 is easy and cheap to detect heuristically. It takes a bit more work to scan the whole file than to just look at the first few bytes, but then I don't think anybody is/was editing 1MB files in Notepad. So the BOM/signature is definitely not the reason that UTF-8 spread on the Web and elsewhere. The spread of UTF-8 is due to its strict US-ASCII compatibility. Every US-ASCII character/byte represents the same character, and only that character, in UTF-8. A plain ASCII file is an UTF-8 file. If syntax-significant characters are ASCII, then (close to) nothing may need to change when moving from a legacy encoding to UTF-8. On top of that, character synchronization is very easy because leading bytes and trailing bytes have strictly separate values. From that viewpoint, the BOM is a problem rather than a solution. … snip … So, given the age of the documents, neither HTML4 from 1999 nor the 'text/html' MIME registration, does not permit anyone to be 'intolerant' of the UTF-8 BOM, but neither does it permit anyone to be 'tolerant' of it. It is silent on the issue. You read silence as not taking sides, which makes sense from your viewpoint. Knowing what implementations did (in a pre-1999 time-frame), the idea of UTF-8 BOM just didn't really exist, so nobody thought about mentioning it. It is interesting to think about this history. And the fact that it was unrealized. May be _that_ is due to the fact that, back then, then one saw XML as the way forward - which meant that there was not the same need for the UTF-8 BOM due to XML's default to UTF-8. However, I think there are two ways to interpret Pre-HTML5: Historic, about 1998. Or current about choices today: 'this browser is fully dedicated to HTML4 but does not intend to implement HTML5'. Pointing to HTML4 for lack of BOM implementation, would be a very thin excuse. I think that a browser fully dedicated to HTML4 but not intending to implement HTML5 will eventually die out. If it exists today, it would indeed be reasonable to accept the BOM. But that's not because reading the spec(s) leads to that as the only conclusion, it's because there's content out there that starts with a BOM. Regards, Martin.
Re: UTF-8 BOM (Re: Charset declaration in HTML)
Hello Philippe, On 2012/07/18 3:37, Philippe Verdy wrote: 2012/7/17 Julian Bradfieldjcb+unic...@inf.ed.ac.uk: On 2012-07-16, Philippe Verdyverd...@wanadoo.fr wrote: I am also convinced that even Shell interpreters on Linux/Unix should recognize and accept the leading BOM before the hash/bang starting line (which is commonly used for filetype identification and runtime The kernel doesn't know or care about character sets. It has a little knowledge of ASCII (or possibly EBCDIC) hardwired, but otherwise it deals with 8-bit bytes. It has no concept of text file. Yes I know. But most tools and script should know on which type of file they are operating on. Unfortunately the tools are as well agnostic and just rely on things that do not pass the transport protocols. Such as filename conventions. Just writing that you are convinced about something a shell should do doesn't change anything. Maybe you can create a patch (or a few patches, because there quite a few tools out there in the Linux/Unix world) and see if you can convince the respective maintainers that it's indeed a good idea. [As others with some amount of Linux/Unix background, I strongly doubt that for Linux/Unix, the BOM is a good idea.] Regards, Martin.
Re: pre-HTML5 and the BOM
Hello Jukka, On 2012/07/17 23:31, Jukka K. Korpela wrote: 2012-07-17 17:11, Leif Halvard Silli wrote: For instance, early on in 'the Web', some appeared to think that all non-ASCII had to be represented as entities. Yes indeed. There's still some such stuff around. It's mostly unnecessary, but it doesn't hurt. Actually, above I described an example where it did hurt ... The situation is comparable to the BOM issue. In a very general sense, probably yes. In the old days, it was considered (with good reasons presumably) safer to omit the BOM than to use it in UTF-8, Yes indeed. and it was considered safer to use entity references rather than direct non-ASCII data. Well, the considered in the BOM case applies to everybody (including the W3C), but in the character references case, it applies only to people who didn't understand how things were working. In fact, although RFC 2070 and HTML4 clearly nailed down the interpretation of numeric character references to Unicode, there were implementations (the ones I know were in the mobile space) past 2000. To take a more modern example, the native e-mail client on my Android seems to systematically display character and entity references literally when displaying message headers with small excerpts of content, even though it correctly interprets them when displaying the message itself. The reason for this may simply be that email bodies can be in HTML, but that there is no way at all to use HTML in email header fields. Regards, Martin.
Re: UTF-8 BOM (Re: Charset declaration in HTML)
Hello Doug, On 2012/07/18 0:35, Doug Ewell wrote: For those who haven't yet had enough of this debate yet, here's a link to an informative blog (with some informative comments) from Michael Kaplan: Every character has a story #4: U+feff (alternate title: UTF-8 is the BOM, dude!) http://blogs.msdn.com/b/michkap/archive/2005/01/20/357028.aspx What should be interesting is that this blog dates to January 2005, seven and a half years ago, and yet includes the following: But every 4-6 months another huge thread on the Unicode List gets started Well, less or more than 4-6 months, but yes. about how bad the BOM is for UTF-8 and how it breaks UNIX tools that have been around and able to support UTF-8 without change for decades Yes indeed. The BOM and Unix/Linux tools don't work well together. and about how Microsoft is evil for shipping Notepad that causes all of these problems That's a bit overblown, but I guess for a Microsoft employee, it looks like this. and how neither the W3C nor Unicode would have ever supported a UTF-8 BOM if Microsoft did not have Notepad doing it, That's true, too. It was indeed Notepad that brought the UTF-8 BOM/signature to the attention of the W3C and the browser makers. The problem with the BOM in UTF-8 is that it can be quite helpful (for quickly distinguishing between UTF-8 and legacy-encoded files) and quite damaging (for programs that use the Unix/Linux model of text processing), and that's why it creates so much controversy. Regards, Martin.
Re: pre-HTML5 and the BOM
On 2012/07/18 16:35, Leif Halvard Silli wrote: Martin J. Dürst, Wed, 18 Jul 2012 11:00:42 +0900: The best reason is simply that nobody should be using crutches as long as they can walk with their own legs. Crutches, in that sense, is only about authoring convenience. And, of course, it is a difference between using named and numeric character references for a single non-ASCII letter as opposed to using it for all of them. Nevertheless: I, as Web author, would perhaps skip that convenience if I knew that doing so could improve e.g. HTML5 browser's ability to sniff the encoding correctly when all other encoding info is lost. If such sniffing can be an alternative to the BOM, and the BOM is questionable, then why not mention it as a reason to avoid the crutches? I'm not sure there are many people for whom using named character entities or numeric character references is a convenience. But for those for whom it is a convenience, let them use it. Regards, Martin.
Re: pre-HTML5 and the BOM
Hello Leif, I think that more and more, we are on the wrong mailing list. Regards, Martin. On 2012/07/18 18:47, Leif Halvard Silli wrote: Martin J. Dürst, Wed, 18 Jul 2012 17:20:31 +0900: On 2012/07/18 16:35, Leif Halvard Silli wrote: Martin J. Dürst, Wed, 18 Jul 2012 11:00:42 +0900: The best reason is simply that nobody should be using crutches as long as they can walk with their own legs. Crutches, in that sense, is only about authoring convenience. […] Nevertheless: I, as Web author, would perhaps skip that convenience if I knew that doing so could improve e.g. HTML5 browser's ability to sniff the encoding correctly […] I'm not sure there are many people for whom using named character entities or numeric character references is a convenience. But for those for whom it is a convenience, let them use it. By all means: Let them. But the W3C's I18N working group still gives out advice about when to (not) use escapes.[1] Advice which the homepage of W3.org breaks - since every non-ASCII character of http://www.w3.org is escaped. What the I18N group says in that document, is a bit moralistic (along the lines 'please think about how difficult it is for non-English authors to read escapes for all their characters). It seems to me that a mention of real effects on browser behavior could be a better form of advice. Especially when coupled with advice about avoiding the BOM.[2] [1] http://www.w3.org/International/techniques/authoring-html#escapes [2] http://www.w3.org/International/questions/qa-byte-order-mark#bomhow
Re: Unicode String Models
On 2012/07/21 7:01, David Starner wrote: I'm concerned about the statement/implication that one can optimize for ASCII and Latin-1. It's too easy for a lot of developers to test speed with the English/European documents they have around and test correctness only with Chinese. I see the argument in theory and practice, but it's a tough line to walk, especially if you're not familiar with i18n. I can see for i in range (1, 1000) do a := ; a +:= 龜; done being way slower than necessary (especially for non-trivially optimized away cases), for example. The main problem with the above loop isn't ASCII vs. Chinese or some such. It's that depending on the way the programming language handles Strings, it will result in a painter's algorithm phenomenon (see http://www.joelonsoftware.com/articles/fog000319.html). Regards, Martin.
Re: Is the Subject field of an e-mail an obvious example of plain text where no higher level protocol application is possible?
Hello Karl, On 2012/07/21 0:41, Karl Pentzlin wrote: Looking for an example of plain text which is obvious to anybody, it seems to me that the Subject field of e-mails is a good example. Common e-mail software lets you enter any text but gives you never access to any higher-level protocol. Possibly you can select the font in which the subject line is shown, but this is completely independent of the font your subject line is shown at the recipient. Thus, you transfer here plain text, and you can use exactly the characters which either Unicode provides to you, or which are PUA characters which you have agreed upon with the recipient before. In fact, the de-facto-standard regulating the e-mail content (RFC 2822, dated April 2001 http://www.ietf.org/rfc/rfc2822.txt , afaik) No. If you go to http://tools.ietf.org/html/rfc2822, you'll see Obsoleted by: 5322, Updated by: 5335, 5336. RFC 5322 is the new version, date October 2008, but doesn't change much. RFC 5335 and 5336 are experimental for encoding the Subject (and a lot of other fields) as raw UTF-8 if the email infrastructure supports it. There are Standards Track updates for these two, RFC 6531 and 6532. But what's more important for your question, at least in theory, is http://tools.ietf.org/html/rfc2231, which defines a way to add language information to header fields such as Subject:. With such information, it would stop to be plain text. In practice, RFC 2231 is not well known, and even less used, so except for detailed technical discussion, your example should be good enough. Regards, Martin. defines the content of the Subject line as unstructured (p.25), which means that is has to consist of US-ASCII characters, which in turn can denote other (e.g. Unicode) characters by the application of MIME protocols. Thus, the result is an unstructured character sequence. There is e.g. no possibility to include superscripted/subscripted characters in a Subject of an e-mail, unless these characters are in fact included as superscript/subscript characters in Unicode directly. Thus, proving the necessity to include a character in the text of a Subject line of an e-mail, is proving that the character has to be available as a plain text character. If, additionally, the character is used outside a closed group (which can be advised to use PUA characters), then there is a valid argument to include such a character in Unicode. Is my assumption correct? (I think of the SUBSCRIPT SOLIDUS proposed in WG2 N3980. It is in fact annoying that you cannot address DIN EN 13501 requirements in an e-mail subject line written correctly, as Unicode, although being an industry standard, until now did not listen to an industry request at this special topic.) - Karl
Re: Character set cluelessness
Richard - Complex script usually refers to scripts where rendering isn't just simply putting glyphs side by side. That includes stuff with combining marks, ligatures, reordering, stacking, and the like. Regards, Martin. On 2012/10/03 7:09, Richard Wordingham wrote: On Tue, 02 Oct 2012 09:14:08 -0700 Doug Ewelld...@ewellic.org wrote: It's 2012. How does one get through to folks like this? Even people who should know better can get confused about character sets. Does anyone know what 'a complex script Unicode range' is? It's a term that occurs in the Office Open XML specification, but I can't find a definition for it. It's just possible that it means a range where hypothetically unassigned characters would not be left-to-right, but I've a feeling it ought to include Vietnamese characters for all that they're Latin script. Possibly the definitions have not been provided because the concept ought to involve the tricky task of breaking text runs into script runs. (Lots of people feel one should be able to add script-specific combining marks to U+25CC DOTTED CIRCLE, U+2013 EN DASH and U+00D7 MULTIPLICATION SIGN or perhaps even U+0078 LATIN SMALL LETTER X. U+0964 DEVANAGARI DANDA is used with the Latin, Devanagari and Tamil scripts, to name but a few.) Richard.
Re: Character set cluelessness
So in order to get something going here, why doesn't Doug draft a letter to these guys (possibly based on the one from a few years ago) and then Mark sends it off in his position at Unicode, which hopefully will impress them more than just a personal contribution. Being upset in this list (which I'm too, of course) doesn't change anything. Regards, Martin. On 2012/10/03 6:15, Doug Ewell wrote: Mark Davis mark at macchiato dot com wrote: I tend to agree. What would be useful is to have one column for the city in the local language (or more columns for multilingual cities), but it is extremely useful to have an ASCII version as well. They have two name fields, one (Name) for the name transliterated into Latin, and a second (NameWoDiacritics) which is an ASCII-smashed version of the first. Again, that's fine as long as I am free to ignore the ASCII version. They don't attempt to represent names in non-Latin scripts, which is not my beef here. There are many names in the Name (i.e. beyond ASCII) field that include characters beyond 8859-1, such as œ and ̆z, and certainly many beyond CP437. This is a good thing (although there are some errors, not as many as in past years), but they need to fix their documentation to reflect what they actually do, and not make these irrelevant, misleading, and/or inaccurate references to 437 and to a 19-year-old version of 10646. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwellshy;
Re: Missing geometric shapes
On 2012/11/08 19:15, Michael Everson wrote: On 8 Nov 2012, at 09:59, Simon Montagusmont...@smontagu.org wrote: Please take into account that the half-stars should be symmetric-swapped in RTL text. I attach an example from an advertisment for a movie published in Haaretz 2 November 2012 I don't think Geometric Shapes have the mirror property. 2605;BLACK STAR;So;0;ON;N; 2606;WHITE STAR;So;0;ON;N; Well, those are usually symmetric, so adding a mirror property wouldn't change much. In a Hebrew context you'd just choose the star you wanted (black-white vs white-black) and use it. That works well if the text is written by hand. If it is produced as part of a script that better work the same for many languages, symmetric swapping would really be very helpful. Regards, Martin. Michael Everson * http://www.evertype.com/
Re: Caret
On 2012/11/13 21:49, Eli Zaretskii wrote: I'd welcome that. Although the reality flies in the face of user requirements in this case: most bidi-aware editors, including my own work in Emacs, don't have 2 carets, for some reason. Maybe the developers didn't consider that important enough, or maybe it's just too darn hard... What's the specific reason in the case of your Emacs work (which I very much appreciate!)? Regards, Martin.
Re: latin1 decoder implementation
Just in case it helps, Ruby (since version 1.9) also uses 3). Regards, Martin. On 2012/11/17 6:48, Buck Golemon wrote: When decoding bytes to unicode using the latin1 scheme, there are three options for bytes not defined in the ISO-8859-1 standard. 1) Throw an error. 2) Insert the replacement glyph (fffd), indicating an unknown character. 3) Insert the unicode character with equal value. This means that completely random bytes will always decode successfully. The Python language currently implements option three. Is this correct? There is an option to produce errors or replacements for encodings which have undefined characters, but as implemented, latin1 currently defines characters for all 256 bytes, so the option does nothing. Restated, are the first 256 characters of unicode intended to be exactly compatible with a latin1 codec? This would imply that unicode has inserted character definitions into the ISO-8859-1 standard.
Re: latin1 decoder implementation
On 2012/11/17 9:45, Doug Ewell wrote: If he is targeting HTML5, then none of this matters, because HTML5 says that ISO 8859-1 is really Windows-1252. Yes. But unless Python wants to limit its use to HTML5, this should be handled on a separate level (mapping a iso-8859-1 label to the Windows-1252 decoder logic), not by trying to change ISO-8859-1 itself. Regards, Martin. For example, there is no C1 control called NL in Windows-1252. There is only 0x85, which maps to U+2026 HORIZONTAL ELLIPSIS. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell From: Philippe Verdy Sent: Friday, November 16, 2012 17:35 To: Whistler, Ken Cc: Buck Golemon ; unicode@unicode.org Subject: Re: latin1 decoder implementation In fact not really, because Unicode DOES assign more precise semantics to a few of these controls, notably for those given whitespace and newline properties (notably TAB, LF, CR in C0 controls and NL in C1 controls, with a few additional constraints for the CR+LF sequence) as they are part of almost all plain text protocols ; NUL also has a specific behavior which is so common that it cannot be mapped to anything else than a terminator or separator of plain text sequences. So even if the ISO/IEC 8859 standard does not specify a charecter mapping in C0 and C1 controls, the registered MIME types are doing so (but nothing is well defined for the C0 and C1 controls except NUL, TAB, CR, LF, NL, for MIME usages purpose). And then yes, the ISO/IEC 8859 standard is different (more restrictive) from the MIME charsets defined by the IETF in some RFC's (and registered in the IANA registry), simply because the ISO/IEC standard (encoded charset) was developed to be compatible with various encoding schemes, some of them defined by ISO, some others defined by other standard European or East-Asian bodies (including 7-bit schemes, using escape sequences, or shift in/out controls). By itself, the ISO/IEC 8859 is not a complete encoding scheme, it is just defining several encoded character sets, independantly of the encoding schme used to store or transport it (it is not even sufficient to represent any plain-text content). On the opposite, The MIME charsets named ISO_8859-* registered by the IETF in the IANA registry are concrete encoding schemes, based on the ISO/IEC 8859 standard, and suitable for representing a plain-text content, because the MIME charsets are also adding a text presentation protocol. In practice, almost nobody today uses the ISO/IEC 8859 standard alone : there's always an additional concrete protocol added on top of it (which generally makes use of the C0 and C1 controls, but not necessarily, and not always the same way). So plain-text documents never use the ISO/IEC 8859 standard, but the MIME charsets (plus a few specific or proprietary charsets that have not been registered in the IANA registry as they are bound to a non-open protocol).
Re: latin1 decoder implementation
On 2012/11/17 9:56, Philippe Verdy wrote: True. HTML5 makes its own reinterpretation of the IETF's MIME standard, definining it own protocol (which means that it is no longer fully compatible with MIME and its IANA datatabase, because the mapping of the value of a charset= pseudo-attribute is not directly to the IETF MIME standard, but to a newer range of W3C standards). There was a clear desire from the W3C to deprecate the use of the MIME standard and its IANA database in HTML, to simplify the implementations There is no need to deprecate the use of MIME in order to simplify implementations. No MIME-compatible implementation is required to accept and understand all charsets defined in the IANA registry. There are numerous Mime types that restrict the number of possible character encodings to a small set, or only require implementation of very few of them (XML would be a typical example). (also to avoid the many incompatibilities that have occured in the past with MIME charsets between the implementations). That's the main motivation. One browser started to accept data in a form that it shouldn't have accepted. Sloppy content producers started to rely on this. Because the browser in question was the dominant browser, other browsers had to try and re-engineer and follow that browser, or just be ignored. The Encoding Spec is an attempt, hopefully successful, to limit these incompatibilities to those that exist today, and not let them increase further. Note also that the W3C does not automatically endorses the Unicode and ISO/IEC 10646 standards as well (there's a delay before accepting newer releases of TUS and ISO/IEC 10646, and the W3C frequently adds now several restrictions). Can you give examples? As far as I'm aware, the W3C has always tried to make sure that e.g. new characters encoded in Unicode can be used as soon as possible. There are some cases where this has been missed in the past (e.g. XML naming rules), but where corrective action has been taken. Regards, Martin. 2012/11/17 Doug Ewelld...@ewellic.org If he is targeting HTML5, then none of this matters, because HTML5 says that ISO 8859-1 is really Windows-1252. For example, there is no C1 control called NL in Windows-1252. There is only 0x85, which maps to U+2026 HORIZONTAL ELLIPSIS. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell From: Philippe Verdy Sent: Friday, November 16, 2012 17:35 To: Whistler, Ken Cc: Buck Golemon ; unicode@unicode.org Subject: Re: latin1 decoder implementation In fact not really, because Unicode DOES assign more precise semantics to a few of these controls, notably for those given whitespace and newline properties (notably TAB, LF, CR in C0 controls and NL in C1 controls, with a few additional constraints for the CR+LF sequence) as they are part of almost all plain text protocols ; NUL also has a specific behavior which is so common that it cannot be mapped to anything else than a terminator or separator of plain text sequences. So even if the ISO/IEC 8859 standard does not specify a charecter mapping in C0 and C1 controls, the registered MIME types are doing so (but nothing is well defined for the C0 and C1 controls except NUL, TAB, CR, LF, NL, for MIME usages purpose). And then yes, the ISO/IEC 8859 standard is different (more restrictive) from the MIME charsets defined by the IETF in some RFC's (and registered in the IANA registry), simply because the ISO/IEC standard (encoded charset) was developed to be compatible with various encoding schemes, some of them defined by ISO, some others defined by other standard European or East-Asian bodies (including 7-bit schemes, using escape sequences, or shift in/out controls). By itself, the ISO/IEC 8859 is not a complete encoding scheme, it is just defining several encoded character sets, independantly of the encoding schme used to store or transport it (it is not even sufficient to represent any plain-text content). On the opposite, The MIME charsets named ISO_8859-* registered by the IETF in the IANA registry are concrete encoding schemes, based on the ISO/IEC 8859 standard, and suitable for representing a plain-text content, because the MIME charsets are also adding a text presentation protocol. In practice, almost nobody today uses the ISO/IEC 8859 standard alone : there's always an additional concrete protocol added on top of it (which generally makes use of the C0 and C1 controls, but not necessarily, and not always the same way). So plain-text documents never use the ISO/IEC 8859 standard, but the MIME charsets (plus a few specific or proprietary charsets that have not been registered in the IANA registry as they are bound to a non-open protocol).
Re: cp1252 decoder implementation
On 2012/11/21 16:23, Peter Krefting wrote: Doug Ewell d...@ewellic.org: Somewhat off-topic, I find it amusing that tolerance of poorly encoded input is considered justification for changing the underlying standards, The encoding work at W3C, at least as far as I see it, is not an attempt to redefine e.g. iso-8859-1 itself. To be blunt, it's just to make clear that lots of Web pages out there are lying, and help browsers detect this in an uniform way. This does not mean that all other software has to do the same. Real ISO-8859-1 will still be treated correctly by browsers. When you create a Web page, if it's really iso-8859-1, then label it as such, but when it's actually windows-1252, then label it as such. And make sure it doesn't contain any undefined (or C1) codepoints. That way, it will interoperate not only with browser, but also with other software. Also, if you write any kind of tool, feel free to use the narrower (real) definition, and to throw up errors. There are very few tools that have to accept as wide a range of data and not throw an error as browsers. when Internet Explorer has been flamed for years and years for tolerating bad input. It's called adapting to reality, unfortunately. There are *a lot* of documents on the web labelled as being iso-8859-1 and/or not labelled at all, which are using characters from the 1252 codepage. And since using the 1252 codepage to decode proper iso-8859-1 HTML documents does not hurt anyone (as HTML up to version 4 explicitly forbids the use of the control codes in the 0x80-0x9F range), that is what everyone does. One browser started to accept data in a form that it shouldn't have accepted. Sloppy content producers started to rely on this. Because the browser in question was the dominant browser, other browsers had to try and re-engineer and follow that browser, or just be ignored. Evidently it's OK if W3C or Python does it, but not if Microsoft does it. Don't blame Microsoft here, it was Netscape (on Windows) that started it, by just mapping the iso-8859-1 input data to a windows-1252 encoded font output. The same pages that would work fine on Windows would show garbage on Unix, until it was patched to also display it as codepage 1252. Internet Explorer wasn't even published when this happened, and I can't remember now whether the first versions of it actually did this, or if it was bolted on later. Thanks for this correction. Because it was windows-1252, I had assumed it was Microsoft. Regards, Martin.
Why 17 planes? (was: Re: Why 11 planes?)
Well, first, it is 17 planes (or have we switched to using hexadecimal numbers on the Unicode list already? Second, of course this is in connection with UTF-16. I wasn't involved when UTF-16 was created, but it must have become clear that 2^16 (^ denotes exponentiation (to the power of)) codepoints (UCS-2) wasn't going to be sufficient. Assuming a surrogate-like extension mechanism, with high surrogates and low surrogates separated for easier synchronization, one needs 2 * 2^n surrogate-like codepoints to create 2^(2*n) new codepoints. For doubling the number of codepoints (i.e. a total of 2 planes), one would use n=8, and so one needs 128 surrogate-like codepoints. With n=9, one gets 4 more planes for a total of 5 planes, and needs 512 surrogate-like codepoints. With n=10, one gets 16 more planes (for the current total of 17), but needs 2048 surrogate codepoints. With n=11, one would get 64 more planes for a total of 65 planes, but would need 8192 codepoints. And so on. My guess is that when this was considered, 1,048,576 codepoints was thought to be more than enough, and giving up 8192 codepoints in the BMP was no longer possible. As an additional benefit, the 17 planes fit nicely into 4 bytes in UTF-8. Regards, Martin. On 2012/11/26 19:47, Shriramana Sharma wrote: I'm sorry if this info is already in the Unicode website or book, but I searched and couldn't find it in a hurry. When extending beyond the BMP and the maximum range of 16-bit codepoints, why was it chosen to go upto 10 and not any more or less? Wouldn't F have been the next logical stop beyond , even if FF (or ) is considered too big? (I mean, I'm not sure how that extra 64Ki chars [10 minus F] could be important...) Thanks.
Re: cp1252 decoder implementation
On 2012/11/17 12:54, Buck Golemon wrote: On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewelld...@ewellic.org wrote: Buck Golemon wrote: Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and to map it to the equally-non-semantic U+81 ? U+0081 (there are always at least four digits in this notation) just by chance doesn't have any definition. But if we take the next of the holes in windows-1258, 0x8D, we get REVERSE LINE FEED. This isn't exactly non-semantic (although of course browsers and quite a bit of other software ignores that meaning). Why do you make this conditional on targeting html5? To me, replacement and error is out because it means the system loses data or completely fails where it used to succeed. There are cases where one wants to avoid as many failures as possible, at the cost of GIGO (garbage in, garbage out). Browsers are definitely in that category. There are other cases where one wants to catch garbage early, and not let it pollute the rest of the data. Currently there's no reasonable way for me to implement the U+0081 option other than inventing a new cp1252+latin1 codec, which seems undesirable. Well, the above two cases cannot be met with one and the same codec (unless of course in the case where there are additional options that allow to switch between one and the other). I feel like you skipped a step. The byte is 0x81 full stop. I agree that it doesn't matter how it's defined in latin1 (also it's not defined in latin1). The section of the unicode standard that says control codes are equal to their unicode characters doesn't mention latin1. Should it? I was under the impression that it meant any single-byte encoding, since it goes out of its way to talk about 8-bit control codes. I'd say it intends to apply to any single-byte encoding with a full C1 range, or in other words, any single-byte encoding conforming to the ISO C0/G0/C1/G1 model (that's used if not defined in ISO 2022). So that would include any encoding of the ISO-8859-X family but not windows- or macintosh encodings. In other words, the C1 range isn't just a dumping ground for cases where the conversion would fail otherwise. Regards, Martin.
Re: Why 17 planes?
To this, my mother would say: Why keep it simple when we can make it complicated?. Regards,Martin. On 2012/11/27 21:01, Philippe Verdy wrote: That's a valid computation if the extension was limited to use only 2-surrogate encodings for supplementary planes. If we could use 3-surrogate encodings, you'd need 3*2ˆn surrogates to encode 2^(3*n) new codepoints. With n=10 (like today), this requires a total of 3072 surrogates, and you encode 2^30 new codepoints. This is still possible today, even if the BMP is almost full and won't allow a new range of 1024 surrogates: you can still use 2 existing surrogates to encode 2048 hyper-surrogates in the special plane 16 (or for private use in the private planes 14 and 15), which will combine with the existing low surrogates in the BMP.
Tool to convert characters to character names
I'm looking for a (preferably online) tool that converts Unicode characters to Unicode character names. Richard Ishida's tools (http://rishida.net/tools/conversion/) do a lot of conversions, but not names. Regards, Martin.
Re: Character name translations
On 2012/12/21 0:59, Asmus Freytag wrote: There have been efforts at a Japanese translation of the text of the standard, I have no idea whether that contains translated names for characters. JIS X 0221-1995, which is a translation of ISO 10646, contains some Japanese character names, but this is mostly limited to Japanese (i.e. those that appear in the original Japanese JIS X0208) symbols and punctuations, and sometimes there are two names for a single character. I don't know about newer translations. Regards, Martin.
Re: Why is endianness relevant when storing data on disks but not when in memory?
On 2013/01/06 7:21, Costello, Roger L. wrote: Does this mean that when exchanging Unicode data across the Internet the endianness is not relevant? Are these stated correctly: When Unicode data is in a file we would say, for example, The file contains UTF-32BE data. When Unicode data is in memory we would say, There is UTF-32 data in memory. When Unicode data is sent across the Internet we would say, The UTF-32 data was sent across the Internet. The first is correct. The second is correct. The third is wrong. The Internet deals with data as a series of bytes, and by its nature has to pass data between big-endian and little-endian machines. Therefore, endianness is very important on the Internet. So you would say: The UTF-32BE data was sent across the Internet. Actually, as far as I'm aware of, the labels UTF-16BE and UTF-16LE were first defined in the IETF, see http://tools.ietf.org/html/rfc2781#appendix-A.1. Because of this, Internet protocols mostly prefer UTF-8 over UTF-16 (or UTF-32), and actual data is also heavily UTF-8. So it would be better to say: When Unicode data is sent across the Internet we would say, The UTF-8 data was sent across the Internet. Regards, Martin.
Re: What does it mean to not be a valid string in Unicode?
On 2013/01/08 3:27, Markus Scherer wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. Regards, Martin.
Re: What does it mean to not be a valid string in Unicode?
On 2013/01/08 14:43, Stephan Stiller wrote: Wouldn't the clean way be to ensure valid strings (only) when they're built Of course, the earlier erroneous data gets caught, the better. The problem is that error checking is expensive, both in lines of code and in execution time (I think there is data showing that in any real-life programs, more than 50% or 80% or so is error checking, but I forgot the details). So indeed as Ken has explained with a very good example, it doesn't make sense to check at every corner. and then make sure that string algorithms (only) preserve well-formedness of input? Perhaps this is how the system grew, but it seems to be that it's yet another legacy of C pointer arithmetic and about convenience of implementation rather than a safety or performance issue. Convenience of implementation is an important aspect in programming. Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. So in this kind of a case, what we are actually dealing with is: garbage in, principled, correct results out. ;-) Sorry, but I have to disagree here. If a list of strings contains items with lone surrogates (garbage), then sorting them doesn't make the garbage go away, even if the items may be sorted in correct order according to some criterion. Regards, Martin.
Re: Normalization rate on the Web
On 2013/01/22 1:12, Denis Jacquerye wrote: Does anybody have any idea of how much of the Web is normalized in NFC or NFD? Or how much not normalized? I have never measured this. But at one time, there was only NFD (and NFKD). The Unicode Consortium, with input from W3C, then defined NFC (and NFKC) to be much closer to the actual encodings used on the Web. So in some sense, Web Content is (mostly) NFC *by design*. Regards,Martin. How would one find out or try to make a smart guess? I know a lot of library catalogue data is in NFD or somewhat decomposed. Is there any other field that heavily uses decomposition? -- Denis Moyogo Jacquerye African Network for Localisation http://www.africanlocalisation.net/ Nkótá ya Kongó míbalé --- http://info-langues-congo.1sd.org/ DejaVu fonts --- http://www.dejavu-fonts.org/
Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?
Hello Roger, The conclusion to your question below is a very clear NO. The reason is that most text is already in NFC. In fact, as I wrote a few days or weeks ago, NFC was defined to capture what's usually around on the Web (and in other places, too). Trying to recommend that everything be in NFD when more than 99% is already in NFC, and that won't change any time soon, just doesn't make sense. Also, most of the statements you have below need more qualifiers. For example, only a very, very small minority of people ever needs to input all possible composed characters (and on top of that, some clever software can do the normalization to NFC while the input in happening). Regards,Martin. On 2013/02/03 22:27, Costello, Roger L. wrote: Hi Folks, Thank you for your excellent responses. Based on your responses, I now wonder why the W3C recommends NFC be used for text exchanges over the Internet. Aside from the size advantage of NFC, there seems to be tremendous advantages to using NFD: - It’s easier to do searches and other text processing on NFD-encoded text. - NFD makes the regular expressions used to qualify its contents much, *much* simpler. - Things like fuzzy text matching are probably easier in NFD. - It’s easier to remember a handful of useful composing accents than the much larger number of combined forms. - It is easier to use a few keystrokes for combining accents than to set up compose key sequences for all the possible composed characters. - Some Unicode-defined processes, such as capitalization, are not guaranteed to preserve normalization forms. - Some operating systems store filenames in NFD encoding. The W3C is currently updating their recommendations [1]: This version of this document was published to indicate the Internationalization Core Working Group's intention to substantially alter or replace the recommendations found here with very different recommendations in the near future. Would you recommend that the W3C change their recommendation from: Use NFC when exchanging text over the Internet. to: Use NFD when exchanging text over the Internet. Would that be your recommendation to the W3C? /Roger [1] http://www.w3.org/TR/charmod-norm/
Re: Why wasn't it possible to encode a coeng-like joiner for Tibetan?
On 2013/04/11 16:30, Michael Everson wrote: On 11 Apr 2013, at 00:09, Shriramana Sharmasamj...@gmail.com wrote: Or was the Khmer model of an invisible joiner a *later* bright idea? Yes. Later, yes. Bright? Most Kambodian experts disagree. Regards, Martin.
Re: Encoding localizable sentences (was: RE: UTC Document Register Now Public)
On 2013/04/23 18:01, William_J_G Overington wrote: On Monday 22 April 2013, Asmus Freytagasm...@ix.netcom.com wrote: I'm always suspicious if someone wants to discuss scope of the standard before demonstrating a compelling case on the merits of wide-spread actual use. The reason that I want to discuss the scope is because there is uncertainty. If people are going to spend a lot of time and effort in the research and development of a system whether the effort would all be wasted if the system, no matter how good and no matter how useful were to come to nothing because it would be said that encoding such a system in Unicode would be out of scope. [I'm just hoping this discussion will go away soon.] You can develop such a system without using the private use area. Just make little pictures out of your characters, and everybody can include them in a Web page or an office document, print them, and so on. The fact that computers now handle text doesn't mean that text is the only thing computers can handle. Once you have shown that your little pictures are widely used as if they were characters, then you have a good case for encoding. This is how many symbols got encoded; you can check all the documentation that is now public. A ruling that such a system, if developed and shown to be useful, would be within scope for encoding in Unicode would allow people to research and develop the system with the knowledge that there will be a clear pathway of opportunity ahead if the research and development leads to good results. As far as I know, the Unicode consortium doesn't rule on eventualities. So, I feel that wanting to discuss the scope of Unicode so as to clear away uncertainty that may be blocking progress in research and development is a straightforward and reasonable thing to do. The main blocking factor is the (limited) usefulness of your ideas. In case that's ever solved, the rest will be comparatively easy. Regards, Martin.
Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)
On 2013/06/22 0:32, Michael Everson wrote: On 21 Jun 2013, at 16:20, Khaled Hosnykhaledho...@eglug.org wrote: Yeah, I don't believe that you can language-tag individual file names for such display as that is markup. Why do you need to? You only need one language, it is not like file names are multilingual high quality text books where every fine typographic detail for each language have to be respected. I expect my Latvian filenames to appear as Latvian, and my Marshallese filenames to appear as Marshallese. The fact that the encoding was screwed up in the 1990s should not oblige compromise on that -- and that is not fine typographic detail. Quite a few people might expect their Japanese filenames to appear with a Japanese font/with Japanese glyph variants, and their Chinese filenames to appear with a Chinese font/Chinese glyph variants. But that's never how this was planned, and that's not how it works today. And it's a pretty easy guess that there are quite a few more users with Japanese and Chinese filenames in the same file system than users with Latvian and Marshallese filenames in the same file system, both because both Chinese and Japanese are used by many more people than Latvian or Marshallese and because China and Japan are much closer than Latvia and the Marshall Islands. Regards, Martin. Only the language that the user care about matters, and this can be easily inferred from the system locale, and passed down to the text rendering stack. For the monolingual user. Michael Everson * http://www.evertype.com/
Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)
On 2013/07/05 16:04, Denis Jacquerye wrote: On Thu, Jul 4, 2013 at 12:07 PM, Michael Eversonever...@evertype.com wrote: The problem is in pretending that a cedilla and a comma below are equivalent because in some script fonts in France or Turkey routinely write some sort of undifferentiated tick for ç. :-) Can we make sure we have covered this from the other side? Are there any languages where there is a letter where both the form with a cedilla and the form with a comma below are used, and are distinguished? In other words, are there any languages where a user seeing a wrong form would be confused as to what the letter is, rather than being potentially surprised or annoyed at the details of the shape? Regards, Martin.
Re: writing in an alphabet with fewer letters: letter replacements
On 2013/07/05 17:25, Stephan Stiller wrote: What I had in mind was more specific: Germans are supposed to convert [ä,ö,ü,ß] to [ae,oe,ue,ss], though I don't know what's considered best/legal wrt documents required for entering the US, for example. I have always used Duerst on plane tickets and the like. On the customs form that you have to fill in when entering the US (the green one), I always just write Dürst; paper is patient. I have added Durst as an additional alternate spelling on a long-term visa application form once, just in case. My impression is that US customs officials are either quite knowledgeable or quite tolerant on such issues (or a mixture of both). The same applies to customs officials in other countries I have traveled to, and other people at airports and such. I guess they get used to these cases quite quickly, seeing so many passports each day. Regards, Martin.
Re: COMBINING OVER MARK?
On 2013/10/02 9:52, Leo Broukhis wrote: Thanks! That comes out exactly right, although using math markup for linguistic purposes is, IMO, a stretch. Why? Surely like in other fields (Math to start with), there somewhere is a boundary between plain text and rich text. Of course it's not always easy to agree on the exact place of the boundary, but in general, most people would agree it's there. Regards,Martin. Leo On Tue, Oct 1, 2013 at 5:24 PM, Mark E. Shoulsonm...@kli.org wrote: |With MathML, you could||use:|| || ||anathematimath**mmultiscriptsnone/mi mathvariant=romans/mimi mathvariant=romanz/mi/**math| (drop that in an HTML document and take a look). This doesn't look like plain text to me. I don't think it argues in favor of any sort of combining Z or general combinator mark. This is just what markup is for. ~mark On 10/01/2013 08:05 PM, Leo Broukhis wrote: If my understanding of interlinear annotations is correct, to achieve similarity with the attached sample some markup will be required as well: anathematisupU+FFF9zU+**FFFAsU+FFFB/supe. Leo On Tue, Oct 1, 2013 at 3:51 PM, Jean-François Colsonj...@colson.eumailto: j...@colson.eu wrote: Le 01/10/13 15:39, Philippe Verdy a écrit : In plain text, we would just use the [s|z] notation without care about presentation font sizes used in the rendered rich text page. It correctly represent the intended alternation without giving more importance to one base letter. But it you wanted to allow plain text search with collators, you would need to choose one as the base letter and the other one as a combining diacritic with ignored higher-level differences, using either US English or British/International English to fix the base letter (the other letter would be an interlinear annotation for the second orthography, either above or below the base letter). Interlinear annotation… Yes, of course, you could write anathematiU+FFF9zU+FFFAs**U+FFFBe. Halas, the characters U+FFF9INTERLINEAR ANNOTATION ANCHOR U+FFFAINTERLINEAR ANNOTATION SEPARATOR U+FFFBINTERLINEAR ANNOTATION TERMINATOR are not supported by any software I know. 2013/10/1 Steffen Daodesdao...@gmail.com mailto:sdao...@gmail.com Khaled Hosnykhaledho...@eglug.org mailto:khaledho...@eglug.org** wrote: |Using TeX: | | \def\s{${}^{\rm s}_{\rm z}$} Using groff: #!/bin/sh - cat \! t.trhttp://t.tr .de zs . nr #1 \\w'z' \\Z'\ \\v'-.25v's\ \\h'-\\n(#1u'\ \\v'.5v'z\ '\ \\h'\\n(#1u' . rr #1 .. Fraterni .zs e. ! groff t.trhttp://t.trt.pshttp://t.ps ps2pdf t.pshttp://t.ps rm t.trhttp://t.tr t.pshttp://t.ps exit 0 (Can surely be tweaked.) |Regards, |Khaled Ciao, --steffen -- Message transféré -- From: Khaled Hosnykhaledho...@eglug.org mailto:khaledho...@eglug.org** To: Leo Broukhisl...@mailcom.commailto:l...@mailcom.com Cc: unicode Unicode Discussionunicode@unicode.org mailto:unicode@unicode.org Date: Tue, 1 Oct 2013 11:09:31 +0200 Subject: Re: COMBINING OVER MARK? On Mon, Sep 30, 2013 at 05:51:09PM -0700, Leo Broukhis wrote: Hi All, Attached is a part of page 36 of Henry Alford's *The Queen's English: a manual of idiom and usage (1888)* [ http://archive.org/details/**queensenglishman00alfohttp://archive.org/details/queensenglishman00alfo ] Is the way to indicate alternative s/z spellings used there plain text (arguably, if it can be done with a typewriter, it is plain text) I see a typeset book not an output of a typewriter. or rich text (ignoring the font size of letters s and z)? If it's the latter, what's the markup to achieve it? Using TeX: \def\s{${}^{\rm s}_{\rm z}$} 49. How are we to decide between {\it s} and {\it z} in such words as anathemati\s{}e, cauteri\s{}e, criti\-ci\s{}e, deodori\s{}e, dogmati\s{}e, fraterni\s{}e, and the rest? Many of these are derived from Greek \bye Regards, Khaled
Re: ¥ instead of \
On 2013/10/23 4:22, Asmus Freytag wrote: On 10/22/2013 11:38 AM, Jean-François Colson wrote: Hello. I know that in some Japanese encodings (JIS, EUC), \ was replaced by a ¥. On my computer, there are some Japanese fonts where the characters seems coded following Unicode, except for the \ which remained a ¥. Yes. I'm using a Japanese Windows 7, and I can't distinguish the two glyphs in your message (and won't use any of them). Is that acceptable from a Unicode point of view? Are such fonts Unicode considered compliant? It's one of those things where there isn't a clean solution that's also backwards compatible. One idea that I have been floating already years ago is that Microsoft with each new release of Windows (and other vendors too, of course) tweak the Yen glyph in the respective fonts to loose more and more of their horizontal bars and the upper right part of the Y, and slant the lower part of the Y more and more. That would put pressure on applications (mostly financial) that still use U+005D with Yen semantics, and help Japanese programmers to move from seeing a Yen symbol where they should see a backslash. There are enough replacements for the Yen symbol. The usual (i.e. 'half-width') is at U+00A6, which came into Unicode from ISO-8859-1 (interesting to note that the Yen appears in a rather constrained Western-European encoding). There's also a full-width variant. One thing that I have never checked personally, but which I heard from a former colleague who knew a lot of character encoding trivia and oddities, is that (at least at some point a few years ago) Japanese MS Word would change U+00A6 to U+005D without asking the user. Possibly the idea was that this way, the data could be more easily converted back from Unicode to Shift_JIS. But in terms of moving away from using U+005D with a Yen glyyh, it was definitely counterproductive. Regards, Martin.
Re: Request for review: 3023bis (XML media types) makes significant changes
Hello Henry, Some comments on your specific questions, which may trigger some additional discussion. On 2013/12/12 1:43, Henry S. Thompson wrote: I'm one of the editors of a proposed replacement for RFC3023 [1], the media type registration for application/xml, text/xml and 3 others. The draft replacement [2] includes several significant changes in the handling of information about character encoding: * In cases where conflicting information is supplied (from charset param, BOM and/or XML encoding declaration) it give a BOM, if present, authoritative status; I'm a bit uneasy about the fact that we now have BOM (internal) - charset (external) - encoding (internal), i.e. internal-external-internal, but I guess there is lots of experience in HTML 5 for giving the BOM precedence. Also, it will be extremely rare to have something that looks like a BOM but isn't, and this combined with the fact that XML balks on encoding errors should make things quite robust. * It recommends against the use of UTF-32. UTF-32 has some (limited) appeal for internal representation, but none really on the network, and media types are for network interchange, so this should be fine, too. Regards, Martin. The interoperability situation in this space is currently poor, with some tools treating a charset parameter as authoritative, but the HTML 5 spec and most browsers preferring the BOM. The goal of the draft is to specify an approach which will promote convergence, while minimising the risk of damage from backward incompatibilities. Since these changes overlap with a wide range of technologies, I'm seeking review outside the relevant IETF mailing list (apps-disc...@ietf.org) -- please take a look if you can, particularly at Section 3 [3] and Appendix C [4]. Thanks, ht [1] http://tools.ietf.org/html/rfc3023 [2] http://tools.ietf.org/html/draft-ietf-appsawg-xml-mediatypes-06 [3] http://tools.ietf.org/html/draft-ietf-appsawg-xml-mediatypes-06#section-3 [4] http://tools.ietf.org/html/draft-ietf-appsawg-xml-mediatypes-06#appendix-C
Fwd: Updated Japanese Legacy Standard? (was: Re: Romanized Singhala got great reception in Sri Lanka)
I got informed today by your IT Dept. that the mail below never went out. Resent herewith.Martin. Original Message Subject: Updated Japanese Legacy Standard? (was: Re: Romanized Singhala got great reception in Sri Lanka) Date: Mon, 17 Mar 2014 12:32:15 +0900 From: Martin J. Dürst due...@it.aoyama.ac.jp On 2014/03/16 14:36, Philippe Verdy wrote: You may still want to promote it at some government or education institution, in order to promote it as a national standard, except that there's little change it will ever happen when all countries in ISO have stopoed working on standardization of new 8-bit encodings (only a few ones are maintained; but these are the most complex ones used in China and Japan. Well in fact only Japan now seens to be actively updating its legacy JIS standard; but only with the focus of converging it to use the UCS and solve ambiguities or solve some technical problems (e.g. with emojis used by mobile phone operators). Even China stopped updating its national standard by publishing a final mapping table to/from the full UCS (including for characters still not encoded in the UCS): this simplified the work because only one standard needs to be maintained instead of 2. I'm not aware of any activity in Japan regarding the update of legacy character encodings. Can you tell me what you mean by actively updating? Regards, Martin. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Fwd: Re: Romanized Singhala got great reception in Sri Lanka
I got informed today by your IT Dept. that the mail below never went out. Resent herewith.Martin. Original Message Subject: Re: Romanized Singhala got great reception in Sri Lanka Date: Mon, 17 Mar 2014 14:37:00 +0900 From: Martin J. Dürst due...@it.aoyama.ac.jp On 2014/03/17 13:16, Jean-François Colson wrote: As for Japanese (and also for Indic) I have read the warnings in RFC 1815: http://tools.ietf.org/rfc/rfc1815.txt RFC 1815 Character Sets ISO-10646 and ISO-10646-J-1 July 1995 July 1995… Is that document up-to-date? No, it's not. Not at all. It was outdated when it was published, and expresses only the opinions of the author (who was well know for not liking, and not very well understanding, Unicode). It's labeled as Informational, which means it is not in any way part of an IETF Standard/specification. Even April 1st RFCs are classified as Informational. The charset label ISO-10646-J-1 it defines is listed at http://www.iana.org/assignments/character-sets/character-sets.xhtml, but I don't think that there is any major conversion library that supports this. Similar for what RFC 1815 labels as ISO-10646, which appears as ISO-10646-Unicode-Latin1 in the IANA registry (because simply using ISO-10646 for this would be strongly misleading). Regards, Martin. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: FYI: More emoji from Chrome
Now that it's no longer April 1st (at least not here in Japan), I can add a (moderately) serious comment. On 2014/04/02 01:43, Ilya Zakharevich wrote: On Tue, Apr 01, 2014 at 09:01:39AM +0200, Mark Davis ☕️ wrote: More emoji from Chrome: http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y I do not know… The demos leave me completely unimpressed: emoji — by their nature — require higher resolution than text, so an emoji for “pie” does not save any place comparing to the word itself. So the impact of this on everyday English-languare communication would not be in any way beneficial. This is somewhat different for Japanese (and languages with similar writing systems) because they have higher line height. Regards, Martin. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: FYI: More emoji from Chrome
On 2014/04/02 20:08, Christopher Fynn wrote: On 02/04/2014, Asmus Freytag asm...@ix.netcom.com wrote: On 4/2/2014 1:42 AM, Christopher Fynn wrote: Rather than Emoji it might be better if people learnt Han ideographs which are also compact (and a far more developed system of communication than emoji). One CJK character can also easily replace dozens of Latin characters - which is what is being claimed for emoji. One wonders why the Japanese, who already know Han ideographs, took to emoji as they did Perhaps because emoji are a sort of playful version of a means of communication they are already used to Yes. Already used to the concept that a character can represent (more or less) a concept. Already used to the concept that there are lots of characters, and a few more won't make such a difference. Already used to the concept that character entry means keying a word or phrase and the selecting what you actually want. But I think the main reason for their spread was that the mobile phone companies introduced them and young people found them cute. In a followup, Line (http://line.me/en/), the most popular Japanese mobile message app (similar to WhatsApp) got popular mostly because of their gorgeous collection of 'stickers' (over 10,000), fortunately after realizing that the technically correct way to deal with them was not squeezing them into the PUA, but treating them as inline images, avoiding headaches down the line for the Unicode Consortium :-). Regards, Martin. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Emoji
On 2014/04/03 02:00, James Lin wrote: Emoji or 顔文字, literally means Face word or Face Characters, essentially, Emoji is 絵文字 (picture character), 顔文字 is kaomoji (face character). Regards, Martin. provides an emotional state in the context of words. Emoji is very popular in APJ, and specially in Japan where most of your text will contain at least half dozen Emoji characters. Remember, people in Japan spend more than half of their commute in the train, and no talk on the cellphone in the train, so most people text instead. Everyone can guess what are the following emoji that used frequently in Japan: ヽ( ̄д ̄;)ノ - worried ヾ(@゜▽゜@)ノ - happy ヽ(#`Д´)ノ - angry 【・_・?】- confused there is a lot more... ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On 2014/06/03 07:08, Asmus Freytag wrote: On 6/2/2014 2:53 PM, Markus Scherer wrote: On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com mailto:prosfil...@gmail.com wrote: I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious bugs in some lamebrained utility. I don't expect handling these in web browsers and lamebrained utilities. I expect treat like unassigned code points. Expecting them to be treated like unassigned code points shows that their use is a bad idea: Since when does the Unicode Consortium use unassigned code points (and the like) in plain sight? I can't shake the suspicion that Corrigendum #9 is not actually solving a general problem, but is a special favor to CLDR as being run by insiders, and in the process muddying the waters for everyone else. I have to fully agree with Asmus, Richard, Shawn and others that the use of non-characters in CLDR is a very bad and dangerous example. However convenient the misuse of some of these codepoints in CLDR may be, it sets a very bad example for everybody else. Unicode itself should not just be twice as careful with the use of its own codepoints, but 10 times as careful. I'd strongly suggest that completely independent of when and how Corrigendum #9 gets tweaked or fixed, a quick and firm plan gets worked out for how to get rid of these codepoints in CLDR data. The sooner, the better. Regards, Martin. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Request for Information
On 2014/07/24 15:37, Richard Wordingham wrote: No. The text samples I could find quickly show scripta continua, but I suspect the line breaks are occurring at word or syllable boundaries. If I am right about the constraint on line break position, then this can be recovered by marking the optional line breaks with ZWSP. In addition, the consonants should be reclassified from AL to SA. However, such a change would be incompatible with a modern writing system in which words are separated by spaces (if such exists). I don't know what happens in Indonesian schools, so I can't report an error. Scripta continua and non-scripta continua in the same script are incompatible in plain text. Shouldn't that be scripta non-continua ? Regards, Martin. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Code charts and code points (was: Re: fonts for U7.0 scripts)
On 2014/10/24 10:21, Asmus Freytag wrote: Peter is correct. The only fonts that should be released to the public are those that are Unicode encoded and have the correct shaping tables. Unlike the public, the code chart editors for Unicode have tools that can correctly handle not only ASCII-hacked fonts as well as PUA-assigned fonts, but also fonts that use the wrong Unicode encoding (because they were designed for an earlier draft with different code point assignments). These tools ignore all shaping tables, so the lack of such tables isn't an issue. The documents created by the code charts editors are no editable in the normal sense, so they can be published without causing problems, like establishing a de-facto encoding. They don't contain running text in these fonts, so there isn't an issue with search - the searchable contents are all character names, annotations etc in Latin letters and digits. Releasing such fonts to the public would establish a de-facto non-sanctioned encoding, because people could create (and interchange) running text using them. Hello Asmus, The code charts are published as PDFs. In general, text in PDFs can be copypasted elsewhere. Is there something in place that makes sure that wrong Unicode encodings for glyphs published in code charts don't leak elsewhere? Regards, Martin. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: emoji are clearly the current meme fad
On 2014/12/18 06:49, Michael Everson wrote: Clearly the plural of emoji is emojis. Not in Japanese, where there are no plural forms. The question of what it is/will be in English will be decided by usage, not by grammar. I'd use 'emoji', but then I'm too biased towards Japanese to be relevant to make any predictions. Regards, Martin. On 16 Dec 2014, at 12:36, Asmus Freytag asm...@ix.netcom.com wrote: Everybody wants in on the act: http://mashable.com/2014/12/12/bill-nye-evolution-emoji/ A./ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode Michael Everson * http://www.evertype.com/ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unicode encoding policy
On 2014/12/24 09:50, Tex Texin wrote: True, however as William points out, apparently the rules have changed, I hope the rules get clarified to clearly state that these are exceptions. so it isn’t unreasonable to ask again whether the rules now allow it, or if people that dismissed the idea in the past would now consider it. Personally, I think this is the wrong place for it, and as has been suggested numerous times, it makes sense to host the discussion elsewhere among interested parties. Although, I am not interested in the general case, there is a need for specialized cases. Just as some road sign symbols are near universal, Actually not. I have been driving (and taking drivers' licences tests) in Switzerland, Japan, and the US. There are lots of similarities, but it'd be difficult for me to come up with an example where they are all identical (up to glyph/design differences). Please see for yourself e.g. at: https://en.wikipedia.org/wiki/Road_signs_in_Switzerland http://www.japandriverslicense.com/japanese-road-signs.asp https://en.wikipedia.org/wiki/Road_signs_in_the_United_States In the US, there are also differences by state. there is a need for symbols for quick and universal communications in emergencies. Identifying places of safety or danger on a map, or for the injured to describe symptoms, pains, and the nature of their injury (or first aid workers to discuss victims’ issues), or to describe the nature of a calamity (fire, landslide, bomb, attack, etc.), etc. Such symbols mostly already exist. For a quick and easy introduction, see e.g. http://www.iso.org/iso/graphical-symbols_booklet.pdf. If use of such symbols is found in running text, or if there is a strong need to use them in running text, some of these might be added to Unicode in the future. But they wouldn't be things invented out of the blue for marketing purposes, they would be well established already. William, You might consider identifying where there are needs for such universal text, and working with groups that would benefit, to get support for universal text symbols. So the first order of business for William (or others) should be to investigate what's already around. Regards, Martin. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Compatibility decomposition for Hebrew and Greek final letters
On 2015/02/20 05:17, Eli Zaretskii wrote: From: Philippe Verdy verd...@wanadoo.fr Date: Thu, 19 Feb 2015 20:31:07 +0100 Cc: Julian Bradfield jcb+unic...@inf.ed.ac.uk, unicode Unicode Discussion unicode@unicode.org The decompositions are not needed for plain text searches, that can use the collation data (with the collation data, you can unify at the primary level differences such as capitalisation and ignore diacritics, or transform some base groups of letters into a single entry, or make some significant primary difference when there are diacritics (for example in German equating 'ae' and 'ä' at the primary level). Sorry, I disagree. First, collation data is overkill for search, since the order information is not required, so the weights are simply wasting storage. Second, people do want to find, e.g., ² when they search for 2 etc. I'm not saying that they _always_ want that, but sometimes they do. There's no reason a sophisticated text editor shouldn't support such a feature, under user control. Well, for cased scripts, search is usually case-insensitive, but case conversions aren't given by compatibility decompositions. If the question isn't Why are there equivalences useful for search that are not covered by compatibility decompositions?, but Why doesn't Unicode provide some data for final/non-final Hebrew letter correspondence?, maybe the answer is that it hasn't been seen as a need up to now because it's so easy to figure out. Regards, Martin. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Compatibility decomposition for Hebrew and Greek final letters
On 2015/02/19 20:47, Julian Bradfield wrote: On 2015-02-19, Eli Zaretskii e...@gnu.org wrote: Does anyone know why does the UCD define compatibility decompositions for Arabic initial, medial, and final forms, but doesn't do the same for Hebrew final letters, like U+05DD HEBREW LETTER FINAL MEM? Or for that matter, for U+03C2 GREEK SMALL LETTER FINAL SIGMA? As far as I understand it: In Arabic, the variant of a letter is determined entirely by its position, so there is no compelling need to represent the forms separately (as characters rather than glyphs) save for the existence of legacy standards (and if there is, you can use the ZWJ/ZWNJ hacks). Thus the forms would not have been encoded but for the legacy standards. Whereas in Hebrew, non-final forms appear finally in certain contexts in normal text; and in Greek, while Greek text may have a determinate choice between σ and ς, there are many contexts where the two symbols are distinguished (not least maths). Digging a bit deeper, the phenomenon of a letter changing shape depending on position is pervasive in Arabic, and involves complicated interdependencies across multiple characters in good-quality typography. But in Hebrew, this phenomenon is minor, and marginal in Greek, and typographic interactions are also very limited. That led to (after some initial tries with alternatives) different encoding models. In Arabic, shaping is the job of the rendering engine, whereas in Hebrew and Greek, it's part of the encoding. As for determinate choice between σ and ς, John Cowan once gave an example of a Greek word (composed of two original words) with a final sigma in the middle. Regards, Martin. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: The NEW Keyboard Layout—IEAOU
What's better on this keyboard when compared to the Dvorak layout? At first sight, it looks heavily right-handed, all the letters that the Dvorak keyboard has on the homerow are on the right hand. Regards, Martin. P.S.: I'm a happy Dvorak user. On 2015/01/26 06:54, Robert Wheelock wrote: Hello! I came up with a BRAND-NEW keyboard layout designed to make typing easier——named the IEAOU (ee-eh-ah-oh-oo) System—based on letter frequencies. The letters in the new IEAOU layout are arranged as follows: (TOP): Digits / Punctuation / Accents (MEDIAL): Q Y :|; W |' L N D T S H +|= \|! (HOME): X K G F ´|` P I E A O U (BOTTOM): C J Z V B M R |, |. ?|/ Please respond to air what you’d think of it. Thank You! ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Tag characters and in-line graphics (from Tag characters)
On 2015/06/04 17:03, Chris wrote: I wish Steve Jobs was here to give this lecture. Well, if Steve Jobs were still around, he could think about whether (and how many) users really want their private characters, and whether it was worth the time to have his engineers working on the solution. I'm not sure he would come to the same conclusion as you. This whole discussion is about the fact that it would be technically possible to have private character sets and private agreements that your OS downloads without the user being aware of it. Now if the unicode consortium were to decide on standardising a technological process whereby rendering engines could seamlessly download representations of custom characters without user intervention, no doubt all the vendors would support it, and all the technical mumbo jumbo of installing privately agreed character sets would be something users could leave for the technology to sort out. You are right that it would be strictly technically possible. Not only that, it has been so for 10 or 20 years. As an example, in 1996 at the WWW Conference in Paris I was participating in a workshop on internationalization for the Web, and by chance I was sitting between the participant from Adobe and the participant from Microsoft. These were the main companies working on font technology at that time, and I asked them how small it would be possible to make a font for a single character using their technologies (the purpose of such a font, as people on this thread should be able to guess, would be as part of a solution to exchange single, user-defined characters). I don't even remember their answers. The important thing here that the idea, and the technology, have been around for a long time. So why didn't it take on? Maybe the demand is just not as big as some contributors on this list claim. Also, maybe while the technology itself isn't rocket science, the responsible people at the relevant companies have enough experience with technology deployment to hold back. To give an example of why the deployment aspect is important, there were various Web-like hypertext technologies around when the Web took off in the 1990. One of them was called HyperG. It was technologically 'better' than the Web, in that it avoided broken links. But it was much more difficult to deploy, and so it is forgotten, whereas the Web took off. Regards, Martin.
Re: Tag characters and in-line graphics (from Tag characters)
On 2015/06/03 07:55, Chris wrote: As you point out, The UCS will not encode characters without a demonstrated usage.”. But there are use cases for characters that don’t meet UCS’s criteria for a world wide standard, but are necessary for more specific use cases, like specialised regional, business, or domain specific situations. Unicode contains *a lot* of characters for specialized regional, business, or domain specific situations. My question is, given that unicode can’t realistically (and doesn’t aim to) encode every possible symbol in the world, why shouldn’t there be an EXTENSIBLE method for encoding, so that people don’t have to totally rearchitect their computing universe because they want ONE non-standard character in their documents? As has been explained, there are technologies that allow you to do (more or less) that. Information technology, like many other technologies, works best when finding common cases used by many people. Let's look at some examples: Character encodings work best when they are used widely and uniformly. I don't know anybody who actually uses all the characters in Unicode (except the guys that work on the standard itself). So for each individual, a smaller set would be okay. And there were (and are) smaller sets, not for individuals, but for countries, regions, scripts, and so on. Originally (when memory was very limited), these legacy encodings were more efficient overall, but that's no longer the case. So everything is moving towards Unicode. Most Website creators don't use all the features in HTML5. So having different subsets for different use cases may seem to be convenient. But overall, it's much more efficient to have one Hypertext Markup Language, so that's were everybody is converging to. From your viewpoint, it looks like having something in between character encodings and HTML is what you want. It would only contain the features you need, and nothing more, and would work in all the places you wanted it to work. Asmus's inline text may be something similar. The problem is that such an intermediate technology only makes sense if it covers the needs of lots and lots of people. It would add a third technology level (between plain text and marked-up text), which would divert energy from the current two levels and make things more complicated. Up to now, such as third level hasn't emerged, among else because both existing technologies were good at absorbing the most important use cases from the middle. Unicode continues to encode whatever symbols that gain reasonable popularity, so every time somebody has a real good use case for the middle layer with a symbol that isn't yet in Unicode, that use case gets taken away. HTML (or Web technology in general) also worked to improve the situation, with technologies such as SVG and Web Fonts. No technology is perfect, and so there are still some gaps between character encoding and markup, some of which may in due time eventually be filled up, but I don't think a third layer in the middle will emerge soon. Regards, Martin.
Re: International Register of Coded Character Sets
On 2015/06/22 05:37, Frédéric Grosshans wrote: I don't know if it's what you're looking for but Google brought me to the following URL. https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ISO-IR.pdf I managed to download the pdf without problems. I also successfully downloaded a standard ( http://www.itscj.ipsj.or.jp/iso-ir/169.pdf ) to check the URLs from the register. I was able to access https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/, but that just says page not found in Japanese. Same for https://www.itscj.ipsj.or.jp/ISO-IR/, http://www.itscj.ipsj.or.jp/ISO-IR/, and http://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ (the http versions redirect to the https versions). I left a note on their contact page (https://www.itscj.ipsj.or.jp/contact/index.html), in Japanese. I'll tell you when I hear back from them. If I don't, I'll call them; I remember having done that a few years ago. Regards,Martin. Le dim. 21 juin 2015 19:41, Doug Ewell d...@ewellic.org a écrit : Does anyone know what happened to the International Register of Coded Character Sets page at http://kikaku.itscj.ipsj.or.jp/ISO-IR/ ? This is the repository for character sets registered for use with ISO 2022. The page was redirected to a general we've reorganized our site page a few weeks ago, and now the entire site seems to be down. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Tag characters and in-line graphics (from Tag characters)
On 2015/05/29 11:37, John wrote: If I had a large document that reused a particular character thousands of times, Then it would be either a very boring document (containing almost only that same character) or it would be a very large document. would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way? If you want space efficiency, the best thing to do is to use generic compression. Many generic compression methods are available, many of them are widely supported, and all of them will be dealing with your case in a very efficient way. Given that its been agreed that private use ranges are a good thing, That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). and given that we can agree that exchanging data is a good thing, Yes, but there are many other ways to do that besides Unicode. And for many purposes, these other ways are better suited. maybe something should bring those two things together. Just a thought. Just a 'non sequitur'. Regards, Martin.
Re: Emoji characters for food allergens
On 2015/07/29 23:27, Andrew West wrote: On 29 July 2015 at 14:42, William_J_G Overington My diet can include soya There already is, you can write My diet can include soya. If you are likely to swell up and die if you eat a peanut (for example), you will not want to trust your life to an emoji picture of a peanut which could be mistaken for something else Yes, in the worst case for something like I like peanuts. or rendered as a square box for the recipient. There may be a case to be made for encoding symbols for food allergens for labelling purposes, but there is no case for encoding such symbols as a form of symbolic language for communication of dietary requirements. Andrew .
Re: Mark-up to Indicate Words
Hello Richard, On 2015/07/15 16:49, Richard Wordingham wrote: What mark-up schemes exist to show that a sequence of letters and combining marks constitutes a single word? Such mark-up would be useful when using spell checkers. At present, I use U+2060 WORD JOINER (WJ) to indicate the absence of a word boundary. (Systematic marking of boundaries using ZWSP is not popular with users, and is normally not used in Thai - it's not supported in their national or Windows 8-bit encodings.) However, it seems likely that when Unicode 8.00 is defined in August, WJ will suppress line breaks but not word breaks. There would still be the limitation that mark-up is not available in plain text. It appears that, for example, Open Document Format has no mark-up to indicate word boundaries, relying instead on the overrides of the word boundary detection algorithms being stored at character level. I'd suggest looking at higher-end formats such as DITA or TEI (Text Encoding Initiative). Regards, Martin. Richard. .
Re: A Bulldog moves on
Hello Doug, Thanks for making us aware of this very sad event. Michael did a lot for Unicode, and fought bravely with his illness. I hope we can all remember him this week at the Unicode Conference, where he gave so many amazing talks. I also hope that somebody somehow will be able to conserve all his tremendously instructive and funny blogs. Regards, Martin. On 2015/10/25 07:57, Doug Ewell wrote: I wish this day had never come. http://obits.dignitymemorial.com/dignity-memorial/obituary.aspx?n=Michael-Kaplan=4246=176192738=ec6b8cda-c4b1-4b5f-9422-925b1e09a03a -- Doug Ewell | http://ewellic.org | Thornton, CO .