RE: Emoji boom?

2019-05-01 Thread Phillips, Addison via Unicode
Why is this surprising? Encoding a script is many many orders of magnitude more complex than encoding emoji. This is especially true given that the scripts that remain unencoded are largely used by small populations (or, in the case of historic scripts, by *no* population at all). It is a

RE: Unicode in the Curriculum?

2015-12-30 Thread Phillips, Addison
> A few months ago I asked a class of 140+ first year Computer Science > programme and Joint programme students - > > Who has heard of Unicode? I do a similar survey whenever I teach the remedial I18N and Unicode classes at Amazon. When I ask if software developers *ever* received any formal

RE: Concise term for non-ASCII Unicode characters

2015-09-20 Thread Phillips, Addison
I agree, although I note that sometimes the additional (redundant) specificity of "non-7-bit-ASCII characters" is needed when talking to people unclear on what "ASCII" means. Addison > -Original Message- > From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Peter >

RE: Usage stats?

2015-03-27 Thread Phillips, Addison
What you might be looking for would be the CLDR project’s “exemplar sets” (see for example [1]), which describes which characters are customarily used for a given language and which are sometimes used. However, this is not the same thing as statistical distribution. One of the points of Unicode

RE: Unicode Sets in 'Unicode Regular Expressions'

2014-05-27 Thread Phillips, Addison
A Unicode set in this context means a set of code points. This is discussed in section 1.2: -- This is done by providing syntax for sets of characters based on the Unicode character properties, and allowing them to be mixed with lists and ranges of individual code points. -- More generally,

Re: Origin of Ellipsis (was: RE: Empty set)

2013-09-15 Thread Phillips, Addison
Not if the limit is counted in characters and not in bytes. Twitter, for example, counts code points in the NFC representation of a tweet. Doug Ewell d...@ewellic.org wrote: Andre Schappo wrote: U+2026 is useful for microblogs when one is looking to save characters Not if the microblog is in

Re: Origin of Ellipsis (was: RE: Empty set)

2013-09-15 Thread Phillips, Addison
Actually, that's my bad: I meant to type scalar value. Stephan Stiller stephan.stil...@gmail.com wrote: On 9/15/2013 3:07 PM, Phillips, Addison wrote: Not if the limit is counted in characters and not in bytes. Twitter, for example, counts code points in the NFC representation of a tweet

RE: Can a single text document use multiple character encodings?

2013-08-28 Thread Phillips, Addison
What kind of document do you mean? For Web formats (HTML, etc.), the answer is no. Addison Addison Phillips Globalization Architect (Amazon Lab126) Chair (W3C I18N WG) Internationalization is not a feature. It is an architecture. -Original Message- From:

RE: What to backup after corruption of code units?

2013-08-27 Thread Phillips, Addison
Back up here refers to decrementing the pointer in the string. If you have a string consisting of the following UTF-16 code units, for example: 00C0 0020 20AC D800 DC00 00C5 0 12 3 4 5 If you set the pointer to code unit number 4 (counting from 0), you'll be

RE: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-07-03 Thread Phillips, Addison
Martin wrote: Quite a few people might expect their Japanese filenames to appear with a Japanese font/with Japanese glyph variants, and their Chinese filenames to appear with a Chinese font/Chinese glyph variants. But that's never how this was planned, and that's not how it works today.

RE: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

2013-02-01 Thread Phillips, Addison
Hi Roger, (This is a personal response, with chair hat off) It is very useful to read the big yellow box at the start of that document, which says: -- This version of this document was published to indicate the Internationalization Core Working Group's intention to substantially alter or

RE: RLI and bdi, and how to get an update of changes

2013-01-15 Thread Phillips, Addison
Code points 2066, 2067, and 2068 are unassigned. I presume you mean U+202B RIGHT-TO-LEFT EMBEDDING (RLE) and U+202C POP DIRECTIONAL FORMATTING. As Roozbeh pointed out, he means the characters added that provide bidi isolation. The W3C Internationalization WG recommends that you use markup in

RE: Are there Unicode processors?

2013-01-07 Thread Phillips, Addison
Unicode processor?? If what you're looking for is code that breaks text into grapheme clusters/words/lines/etc., that's called text segmentation and is described in: http://www.unicode.org/reports/tr29/ But you go on to talk about characters and their properties.. if you're looking

RE: latin1 decoder implementation

2012-11-16 Thread Phillips, Addison
The block is actually named for it. See: http://www.unicode.org/charts/PDF/U0080.pdf This FAQ talks about it: http://www.unicode.org/faq/blocks_ranges.html Finally, p217 of the standard actually says so explicitly: http://www.unicode.org/versions/Unicode6.2.0/ch07.pdf Addison

RE: Searching data: map countries to scripts

2012-08-21 Thread Phillips, Addison
Doug opined: I can state that for Israel the scripts in common use are Hebrew, Latin (mainly for English but also for several other languages), Arabic and Cyrillic. I do believe that Israel and Palestine (the Gaza Strip and West Bank areas) also use the Greek alphabet, because there

RE: Klingon on Unicode site?

2012-04-03 Thread Phillips, Addison
Asmus opined: I think Yucca has a point. When the document is in English, it doesn't make sens to display the footer date in the system locale. The locale used for this function should either be that of site, or that of the page. AP And hence the work to internationalize JavaScript and

Re: Japanese font on Non-Japanese Android phones

2011-10-08 Thread Phillips, Addison
That's for search analysis, not rendering. Sent from my iPhone On Oct 8, 2011, at 7:45 AM, Andreas Prilop prilop4...@trashmail.net wrote: On Fri, 7 Oct 2011, Gerrit wrote: So if somebody from Google reads this, [...] Additionally, if the standard Android web browser could then use the

RE: UNICODE version of _T(x) macro

2010-11-22 Thread Phillips, Addison
sowmya satyanarayana sowmya underscore satyanarayana at yahoo dot com wrote: Taking this, what is the best way to define _T(x) macro of UNICODE version, so that my strings will always be 2 byte wide character? Unicode characters aren't always 2 bytes wide. Characters with values

RE: composite graphemes

2010-10-13 Thread Phillips, Addison
Hello, UAX #29 (Unicode Text Segmentation) discusses this at length. See especially the section on grapheme cluster boundaries: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries Certainly a function that returns first code point of a string is different from one that finds