Re: Unicode for Windows CE
Thanks for the link. It is good to know that MSKLC can be used for creating Keyboard Driver for WinCE. But is it true only truetype fonts can be used. No OTF? Thanks and refgares Mustafa Jabbar Quoting Christopher John Fynn [EMAIL PROTECTED]: Suggest you check the Global Development pages at Microsoft http://www.microsoft.com/globaldev/default.mspx (links on the right of the page) and http://www.microsoft.com/globaldev/getwr/wincei18n.mspx to find out about Unicode Support in Windows CE, Windows CE fonts and creating keyboard layouts (IME) for Win CE. You could have found this out in an instant by searching for: Windows CE Unicode on Microsoft's web site. -- Christopher J. Fynn - Original Message - From: [EMAIL PROTECTED] To: Patrick Andries [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Saturday, November 29, 2003 4:51 AM Subject: Unicode for Windows CE Dear all, Can anyone tell me how I can have Unicode support in Windows CE. What are the tools for creating OTF and Keyboard Driver? Thanks and regards Mustafa Jabbar - This mail sent through bangla.net, The First Online Internet Service Provider In Bangladesh -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. - This mail sent through bangla.net, The First Online Internet Service Provider In Bangladesh
Re: numeric properties of Nl characters in the UCD
Arcane Jill wrote: PLEASE don't quote me out of context, Doug. You can't quote This being so without also quoting what the This predicate was upon which the conclusions were based. As it happens, it was subsequently pointed out to me that the This predicate was, in fact, NOT so, therefore it is perfectly obvious that the conclusion will no longer follow from the predicate. What's more, the post from which you were quoting was my ASKING for the Unicode definition of decimal digit, not ascribing one. The fact that I said IF it is defined in such-and- such a way in Unicode THEN xyz follows, does NOT imply that xyz follows regardless of the if condition. I don't like being misquoted, quoted out of context, or being accused of taking positions which I do not take, and I really don't like it when someone actually argues against a position which I do not take, as though I had said something I hadn't. (That's usually considered a straw man argument). I humbly request that in future, if people were to respond to what I have actually said in full, instead of to part of it taken completely out of context, then I'd feel a lot happier. Of course I know what decimal means in everyday language. Do you think I'm an idiot? Please stop treating me as one. At no point did I mean to imply this, nor did I make any personal attack on anyone. That should be obvious to either the casual reader or to anyone familiar with the way I've conducted business on this list for the past six years. I'd appreciate, just as I'm sure Jill would, having my intentions interpreted fairly and reasonably. I am probably guilty of misunderstanding Jill's post and jumping to a conclusion based on a single sentence. Here is the full context, from Jill's post dated 2003-11-26T23:57: Note especially the number fields for the hex digits: they are numeric, they are even digits, but they're not *decimal* digits. ...which brings me back to my question (which no-one's answered yet). What do the properties digit versus decimal digit actually MEAN? Is it possible for someone to give a PRECISE definition. I mean, it seems pretty clear that decimal digit does NOT mean radix ten digit (otherwise circled digit 2 would be a decimal digit, and it isn't). I can only assume that the INTENDED meaning of what is (erroneously?) called decimal digit is a character which is permitted to play a part in a positional number system - thus 2 is a decimal digit because it can form part of the legal number 123, but circled digit 2 is not because 13 is not a legal number. Am I even close? This being so, it is possible that the (misnamed) property decimal digit should also apply to Ewellic hex digits. They're not radix ten, but that's not what decimal digit means anyway. They ARE capable of being used in a positional number system. The most precise definition available is probably the one in Section 4.6 of the Unicode Standard, titled Numeric Value -- Normative (TUS 4.0 p. 100, original emphasis retained): *Decimal digits* form a large subcategory of numbers consisting of those digits that can be used to form decimal-radix numbers. They include script-specific digits, not characters such as Roman numerals (1, 5 = 15 = fifteen, but I, V = IV = four), subscripts, or superscripts. Numbers other than decimal digits can be used in numerical expressions, but it is up to the users to determine the specialized uses. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Unicode for Windows CE
Thanks for the link. It is good to know that MSKLC can be used for creating Keyboard Driver for WinCE. But is it true only truetype fonts can be used. No OTF? Thanks and refgares Mustafa Jabbar I doubt that PostScript flavour OpenType fonts can be used since that would require some form of Adobe Type Manager in Windows CE. Simple TrueType flavour OpenType fonts that don't require Uniscribe probably work but for complex script layout for scripts such as Bangla / Bengali there would have to be the equivalent of USP10.DLL running in Windows CE - and I've never heard of anything like that. You'd have to try asking on the MS Volt list or someone in Microsoft Typography. Most of what I see listed on the MS web-site is about support for east asian (CJK) scripts in Win CE - nothing so far about any complex Indic or Arabic scripts. Personally I wouldn't expect support for complex scripts like Bengali to appear in Windows CE until some time after all the main complex scripts are fully supported in Windows XP. Uniscribe (USP10.DLL) is constantly being updated with support for new scripts and it would seem to make sense to make a version for Win CE only once Uniscribe already has support for more or less all the scripts they plan to support. That is unless there is a huge commercial demand for complex script support in Win CE and it is both practical and commercially worth while for them to implement it. OpenType fonts for complex scripts on Windows CE would need very good hinting and ClearType to be useable since text is rendered at a small size. There is probably also the issue of getting handwriting recognition for scripts like Bengali to work well since that is the main input method for many CE devices. - Chris
RE: Oriya: nndda / nnta?
-Original Message- From: Michael Everson [mailto:[EMAIL PROTECTED] Pronounced as you mean it here refers to the reading rules, not the structure of the script. That seems to me to be saying we should be encoding the structure of the script (a statement I'd agree with in general). It can't be a NNTA since that would assimilate to NNTTA. Wouldn't it be more likely for a nasal to assimilate to an obstruent rather than the other way? (We say 'impossible', not 'intossible'.) But that statement is following phonology, not the structure of the script. Your statements seem inconsistent to me. The question is, do we encode something based on it's shape, or based on the phonemes it represents. Following clear cases, the shape is that of TA. NN.TA is phonologically unlikely, though, whereas NN.TTA or NN.DDA is phonologically plausible; so, on the other hand, we could say it makes little sense to encode NN.TA, and so should encode this as NN.DDA. I guess I'd be inclined to go with that reasoning, though I have encountered an NN.DDA conjunct that uses a subjoined small DDA in a font (see attached); haven't encountered that in texts so far, though. Besides my book gives NNDDA explicitly as being made of NNA and DDA and has the same glyph. OK, that's two sources that indicate this. I'll go with that. The book is Learn Oriya in 30 Days, a 150-page introductory grammarin the National Integration Language Series. Thanks for the reference. I've tracked down a copy and it's on its way. Peter Constable attachment: sandnya_or_411.png
RE: MS Windows and Unicode 4.0 ?
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Patrick Andries Is there any plan for Microsoft to support Unicode 4.0, distribute with its operating system the corresponding fonts and update the corresponding Character Map tools/charts (Office and OS) ? Of course, there is a certain vagueness to the question surrounding the issue of what it means to say product X supports Unicode 4.0. However you understand it, we are making steady progress in that direction, in that we are continuing to broaden support in all kinds of services the OS provides. Note that that might mean that we'll provide underlying support for a particular script if fonts and input methods are supplied from somewhere else. I don't know when Character Map will be updated. What happens in Office - e.g. Insert|Symbol - someone else would have to answer, though I think that the Insert|Symbol dialog follows what's in the selected font for the characters it shows and for the scripts it lists as subsets (following the Unicode bitfield in the OS/2 table -- I'd need to do some testing to be sure). Peter Constable
RE: Oriya: mba / mwa ?
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michael Everson I think the TDIL chart is wrong. It seems reasonable that one should need extra persuasion to take the word of an American living in Ireland over Indians. (Sorry.) Traditionally (as in Learn Oriya in 30 Days) subjoined BA is used in this context although the reading rules say to pronounce it [w]. So, you're saying that all of these should be encoded as C + virama + BA? Now an original ligature of O and BA has been pressed into service I've seen elsewhere that you've described this as a ligature involving O, but are you sure it's that? Note that the same shape is used for NYA and NNA (e.g. conjuncts for NN.NNA and SS.NNA). The traditional BA should be used for that unless we have better evidence than the TDIL newsletter that such should be the practice. I could be convinced of that; but if people in India aren't convinced of that, the boat may not float. Peter Constable
RE: Oriya: nndda / nnta?
At 12:32 -0800 2003-11-29, Peter Constable wrote: Pronounced as you mean it here refers to the reading rules, not the structure of the script. That seems to me to be saying we should be encoding the structure of the script (a statement I'd agree with in general). Sure. It can't be a NNTA since that would assimilate to NNTTA. Wouldn't it be more likely for a nasal to assimilate to an obstruent rather than the other way? (We say 'impossible', not 'intossible'.) The dental t assimilates to the retroflex n. But that statement is following phonology, not the structure of the script. Your statements seem inconsistent to me. I'm saying that the syllable NNTA isn't a probable syllable, because it would assimilate to NNTTA, while NNDDA is a phonetically normal syllable, which is the answer to your question. The question is, do we encode something based on it's shape, or based on the phonemes it represents. It's Brahmic. We encode according to the characters used to write the phonemes. The glyph shape is secondary. Following clear cases, the shape is that of TA. The shape in my source shows the same shape for subjoined TA and DDA. NN.TA is phonologically unlikely, though, whereas NN.TTA or NN.DDA is phonologically plausible; so, on the other hand, we could say it makes little sense to encode NN.TA, and so should encode this as NN.DDA. That's correct. I guess I'd be inclined to go with that reasoning, though I have encountered an NN.DDA conjunct that uses a subjoined small DDA in a font (see attached); haven't encountered that in texts so far, though. Well. Where did you encounter it? Besides my book gives NNDDA explicitly as being made of NNA and DDA and has the same glyph. OK, that's two sources that indicate this. I'll go with that. Good. The book is Learn Oriya in 30 Days, a 150-page introductory grammarin the National Integration Language Series. Thanks for the reference. I've tracked down a copy and it's on its way. I'm sure it's in http://www.evertype.com/scriptbib.html -- Michael Everson * * Everson Typography * * http://www.evertype.com
Need to update Technical Work page
I noticed the following on the Technical Work page on the Unicode Web site, at http://www.unicode.org/techwork.html: The Unicode Standard was the basis for the Universal Character Set, two-octet form (UCS-2) of ISO/IEC 10646. The Unicode Standards 65,536 code values are the first 65,536 code values of ISO 10646. I wonder if this passage is very old, predating the full acceptance of the surrogate mechanism in the Unicode Standard. I suggest this text be revised to avoid perpetuating the common misconception that Unicode is a 16-bit-only standard, or that Unicode and ISO 10646 have different repertoires. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Compression through normalization
Someone, I forgot who, questioned whether converting Unicode text to NFC would actually improve its compressibility, and asked if any actual data was available. Certainly there is no guarantee that normalization would *always* result in a smaller file. A compressor that took advantage of normalization would have to determine whether there would be any benefit. One extremely simple example would be text that consisted mostly of Latin-1, but contained U+212B ANGSTROM SIGN and no other characters from that block. By converting this character to its canonical equivalent U+00C5: * UTF-8 would use 2 bytes instead of 3. * SCSU would use 1 byte instead of 2. * BOCU-1 would use 1 or 2 bytes instead of always using 2. A longer and more realistic case can be seen in the sample Korean file at: http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt This file is in EUC-KR, but can easily be converted to Unicode using recode, SC UniPad, or another converter. It consists of 3,317,215 Unicode characters, over 96% Hangul syllables and Basic Latin spaces, full stops, and CRLFs. When broken down into jamos (i.e. converting from NFC to NFD), the character count increases to 6,468,728. The entropy of the syllables file is 6.729, yielding a Huffman bit count of 22.3 million bits. That's the theoretical minimum number of bits that could be used to encode this file, character by character, assuming a Huffman or arithmetic coding scheme designed to handle 16- or 32-bit Unicode characters. (Many general-purpose compression algorithms can do better.) The entropy of the jamos file is 4.925, yielding a Huffman bit count of 31.8 million bits, almost 43% larger. When encoded in UTF-8, SCSU, or BOCU-1, the syllables file is smaller than the jamos file by 55%, 17%, and 32% respectively. General-purpose algorithms tend to reduce the difference, but PKZip (using deflate) compresses the syllables file to an output 9% smaller than that of the jamos file. Using bzip2, the compressed syllables file is 2% smaller. So we can at least say that Korean, which can be normalized from NFD to NFC algorithmically and without the use of long tables of equivalents or exclusions, can consistently be compressed to a smaller size after such normalization than before. Whether a silent normalization to NFC can be a legitimate part of Unicode compression remains in question. I notice the list is still split as to whether this process changes the text (because checksums will differ) or not (because C10 says processes must consider the text to be equivalent). -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
RE: Oriya: mba / mwa ?
At 13:17 -0800 2003-11-29, Peter Constable wrote: I think the TDIL chart is wrong. It seems reasonable that one should need extra persuasion to take the word of an American living in Ireland over Indians. (Sorry.) Peter, I would take those TDIL publications with a very large grain of salt. Textual evidence is not given and there's all sorts of of stuff which really doesn't fit in well with the way we do things in Unicode. Like their *U+0B3A ORIYA INVISIBLE LETTER. Just because it comes from India doesn't mean it's not revisionist. Traditionally (as in Learn Oriya in 30 Days) subjoined BA is used in this context although the reading rules say to pronounce it [w]. So, you're saying that all of these should be encoded as C + virama + BA? Yes, I am. KA + BA = KBA pronounced [kwa]. That's what Learn Oriya in 30 days shows explicitly. Now an original ligature of O and BA has been pressed into service I've seen elsewhere that you've described this as a ligature involving O, but are you sure it's that? Yes, I am. Note that the same shape is used for NYA and NNA (e.g. conjuncts for NN.NNA and SS.NNA). Be thou not deceived by the glyph shapes. The etymology is O + BA = WA, not NYA + BA. The traditional BA should be used for that unless we have better evidence than the TDIL newsletter that such should be the practice. I could be convinced of that; but if people in India aren't convinced of that, the boat may not float. WA is an innovation, unattested in earlier Oriya. You won't find it in Learn Oriya in 30 Days, for instance. Yet syllables in -[wa] have been written in Oriya for a long time, with BA. Note that a historical VA exists and predates the WA, and the TDIL does not take this into account. We did encode it however. I have just ordered two large Oriya dictionaries which should arrive in a fortnight. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Unicode for Windows CE
I also sincrely doubt that MSKLC will create keyboards that will work on a CE device, to tell you the truth. Maybe they do, but they have never been tested there and I would be surprised if they had no problems (never forget the First Tester's Axiom!). MichKa [MS] NLS Collation/Locale/Keyboard Development Globalization Infrastructure and Font Technologies - Original Message - From: Christopher John Fynn [EMAIL PROTECTED] To: Unicode List [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Saturday, November 29, 2003 10:35 AM Subject: Re: Unicode for Windows CE Thanks for the link. It is good to know that MSKLC can be used for creating Keyboard Driver for WinCE. But is it true only truetype fonts can be used. No OTF? Thanks and refgares Mustafa Jabbar I doubt that PostScript flavour OpenType fonts can be used since that would require some form of Adobe Type Manager in Windows CE. Simple TrueType flavour OpenType fonts that don't require Uniscribe probably work but for complex script layout for scripts such as Bangla / Bengali there would have to be the equivalent of USP10.DLL running in Windows CE - and I've never heard of anything like that. You'd have to try asking on the MS Volt list or someone in Microsoft Typography. Most of what I see listed on the MS web-site is about support for east asian (CJK) scripts in Win CE - nothing so far about any complex Indic or Arabic scripts. Personally I wouldn't expect support for complex scripts like Bengali to appear in Windows CE until some time after all the main complex scripts are fully supported in Windows XP. Uniscribe (USP10.DLL) is constantly being updated with support for new scripts and it would seem to make sense to make a version for Win CE only once Uniscribe already has support for more or less all the scripts they plan to support. That is unless there is a huge commercial demand for complex script support in Win CE and it is both practical and commercially worth while for them to implement it. OpenType fonts for complex scripts on Windows CE would need very good hinting and ClearType to be useable since text is rendered at a small size. There is probably also the issue of getting handwriting recognition for scripts like Bengali to work well since that is the main input method for many CE devices. - Chris
Brahmic list ? (was: Oriya: mba / mwa ?)
Michael Everson writes: Peter Constable wrote: I think the TDIL chart is wrong. It seems reasonable that one should need extra persuasion to take the word of an American living in Ireland over Indians. (Sorry.) Isn't there a specific list for Brahmic scripts? ([EMAIL PROTECTED] ???). We are near to explode the number of issues with these scripts if Indian sources start publishing new undated references for their encoding and conversion to Unicode, including proposed changes of orthographic rules to better match either the phonology or the tradition or the inclusion of foreign terms. SIL.org also is working quite actively in this area, in relation with a proposed extended UTR22 reference for transcoding. But I'd like to see discussions about proposed UTR22 changes in the main Unicode list. There's not much isues with Thai as it has been standardized since long in TIS620, which was the base of Unicode encoding (but shamely before UTR22 was produced which would have allowed a better logical encoding without needing lexical dictionnaries to parse the Thai text). Semantic analysis of Thai text is an interesting issue by itself, but not for the correct way to encode Thai words (TIS620 rules are clear as it mostly encodes glyphs, expecting that readers will interpret the written text using their knowledge of the language). So Thai discussions can remain in the main list. I also think that Tibetan issues should be discussed in that list, despite its composition model is very different from Brahmic scripts of India, unless there's a specific rapporteur group for it. But not Han issues which should be discussed possibly in their own list in relation with the IRG workgroup (which already works on its own technical reports as well as the standardization of the extended repertoire). The recent issues I have read seem to multiply the number of Brahmic conjuncts we have to deal with, possibly in relation with new normalization forms (not NFC and NFD); as for Hebrew, there's probably a need for work in these scripts with a separate discussion list, with the aim to produce a technical report in accordance to Indian sources. Other related South Asian scripts should be there too: Lao, Khmer... My recent works with UCA and collation, as well as UTR22 and phonologic analysis of many texts tend to promote the idea of new normalization forms in all areas where NFC/NFD or even NFKC/NFKD are failing (we can't change them due to the stability pact, but UCA and collation in general seems to create a new coded character set (made of ordered collation weights belonging to separate ranges for each collation level, these ranges being sorted in the reverse order of the collation level). I've tried to experiment a collation algorithm to implement UCA by the same system as used in UCD decompositions, but with added (and sometimes modified) decompositions. This system creates new code points needed to represent only font compatibility differences, ligatures, or alternate forms, as a decomposition of the existing compatibility character, into more basic characters exposed with primary differences in UCA, plus these new characters given variable collation weights, which may be ignorable in applications which ignore extra levels. This encoding uses a 31 bit code space, which is still highly compressible, but still representable with the UTF-8 TES (but they are not containing Unicode code points) or similar ad-hoc representation. I am currently trying to adapt this system to work in relation with UTR22 transcodings, and I am testing it against Brahmic scripts, Hebrew, and Latin. This is very promizing, and my next step will be to handle decomposition of Han characters into their component radicals and strokes. I do think that it is possible to handle almost all UCA and UTR22 rules by using UTR22 itself and decomposition rules in a simple table matching nearly the format of the UCD. But all these discussions and encoding ambiguities of Brahmic scripts are polluting my work. I am quite near to remove my current work on them, until there's some agreement found, notably within an revision of ISCII if there's one in preparation which will be more precise and will give more precise rules. For now it is impossible for me to adapt my model with the proposed (and sometimes contradictory) encoding solutions proposed by distinct people. __ ella for Spam Control has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com attachment: winmail.dat
Re: Brahmic list ? (was: Oriya: mba / mwa ?)
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: I've tried to experiment a collation algorithm to implement UCA by the same system as used in UCD decompositions, but with added (and sometimes modified) decompositions. This system creates new code points needed to represent only font compatibility differences, ligatures, or alternate forms, as a decomposition of the existing compatibility character, into more basic characters exposed with primary differences in UCA, plus these new characters given variable collation weights, which may be ignorable in applications which ignore extra levels. This encoding uses a 31 bit code space, which is still highly compressible, but still representable with the UTF-8 TES (but they are not containing Unicode code points) or similar ad-hoc representation. Please don't use UTF-8 to encode anything other than Unicode code points. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Brahmic list ? (was: Oriya: mba / mwa ?)
Philippe Verdy [EMAIL PROTECTED] wrote: I also think that Tibetan issues should be discussed in that list, despite its composition model is very different from Brahmic scripts of India, unless there's a specific rapporteur group for it. There already is a specific list for Tibetan script issues: [EMAIL PROTECTED]