Re: Unicode for Windows CE

2003-11-29 Thread mjabbar

Thanks for the link. It is good to know that MSKLC can be used for creating 
Keyboard Driver for WinCE. But is it true only truetype fonts can be used. No 
OTF?
Thanks and refgares
Mustafa Jabbar


Quoting Christopher John Fynn [EMAIL PROTECTED]:

 Suggest you check the Global Development pages at Microsoft
 http://www.microsoft.com/globaldev/default.mspx   (links on the right of the
 page) and
 http://www.microsoft.com/globaldev/getwr/wincei18n.mspx
 to find out about Unicode Support in Windows CE, Windows CE fonts and
 creating
 keyboard layouts (IME) for Win CE.
 
 You could have found this out in an instant by searching for: Windows CE
 Unicode on Microsoft's web site.
 
 --
 Christopher J. Fynn
 
 
 
 - Original Message - 
 From: [EMAIL PROTECTED]
 To: Patrick Andries [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Sent: Saturday, November 29, 2003 4:51 AM
 Subject: Unicode for Windows CE
 
 
  Dear all,
  Can anyone tell me how I can have Unicode support in Windows CE. What are
 the
  tools for creating OTF and Keyboard Driver?
  Thanks and regards
  Mustafa Jabbar
 
  -
  This mail sent through bangla.net, The  First Online Internet Service
 Provider In Bangladesh
 
 
 
 
 -- 
 This message has been scanned for viruses and
 dangerous content by MailScanner, and is
 believed to be clean.
 




-
This mail sent through bangla.net, The  First Online Internet Service Provider In 
Bangladesh



Re: numeric properties of Nl characters in the UCD

2003-11-29 Thread Doug Ewell
Arcane Jill wrote:

 PLEASE don't quote me out of context, Doug. You can't quote This
 being so without also quoting what the This predicate was upon
 which the conclusions were based. As it happens, it was subsequently
 pointed out to me that the This predicate was, in fact, NOT so,
 therefore it is perfectly obvious that the conclusion will no longer
 follow from the predicate. What's more, the post from which you were
 quoting was my ASKING for the Unicode definition of decimal digit,
 not ascribing one. The fact that I said IF it is defined in such-and-
 such a way in Unicode THEN xyz follows, does NOT imply that xyz
 follows regardless of the if condition. I don't like being
 misquoted, quoted out of context, or being accused of taking positions
 which I do not take, and I really don't like it when someone actually
 argues against a position which I do not take, as though I had said
 something I hadn't. (That's usually considered a straw man
 argument). I humbly request that in future, if people were to respond
 to what I have actually said in full, instead of to part of it taken
 completely out of context, then I'd feel a lot happier.

 Of course I know what decimal means in everyday language. Do you
 think I'm an idiot? Please stop treating me as one.

At no point did I mean to imply this, nor did I make any personal attack
on anyone.  That should be obvious to either the casual reader or to
anyone familiar with the way I've conducted business on this list for
the past six years.  I'd appreciate, just as I'm sure Jill would, having
my intentions interpreted fairly and reasonably.

I am probably guilty of misunderstanding Jill's post and jumping to a
conclusion based on a single sentence.  Here is the full context, from
Jill's post dated 2003-11-26T23:57:

 Note especially the number fields for the hex digits: they are
 numeric, they are even digits, but they're not *decimal* digits.

 ...which brings me back to my question (which no-one's answered yet).
 What do the properties digit versus decimal digit actually MEAN?
 Is it possible for someone to give a PRECISE definition. I mean, it
 seems pretty clear that decimal digit does NOT mean radix ten
 digit (otherwise circled digit 2 would be a decimal digit, and it
 isn't). I can only assume that the INTENDED meaning of what is
 (erroneously?) called decimal digit is a character which is
 permitted to play a part in a positional number system - thus 2 is
 a decimal digit because it can form part of the legal number 123,
 but circled digit 2 is not because 13 is not a legal number. Am I
 even close?

 This being so, it is possible that the (misnamed) property decimal
 digit should also apply to Ewellic hex digits. They're not radix ten,
 but that's not what decimal digit means anyway. They ARE capable of
 being used in a positional number system.

The most precise definition available is probably the one in Section 4.6
of the Unicode Standard, titled Numeric Value -- Normative (TUS 4.0 p.
100, original emphasis retained):

 *Decimal digits* form a large subcategory of numbers consisting of
 those digits that can be used to form decimal-radix numbers.  They
 include script-specific digits, not characters such as Roman
 numerals (1, 5 = 15 = fifteen, but I, V = IV = four), subscripts,
 or superscripts.  Numbers other than decimal digits can be used in
 numerical expressions, but it is up to the users to determine the
 specialized uses.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: Unicode for Windows CE

2003-11-29 Thread Christopher John Fynn

 Thanks for the link. It is good to know that MSKLC can be used for creating
 Keyboard Driver for WinCE. But is it true only truetype fonts can be used. No
 OTF?
 Thanks and refgares
 Mustafa Jabbar

I doubt that PostScript flavour OpenType fonts can be used since that would
require some form of Adobe Type Manager in Windows CE. Simple TrueType flavour
OpenType fonts that don't require Uniscribe probably work but for complex
script layout for scripts such as Bangla / Bengali there would have to be the
equivalent of USP10.DLL  running in Windows CE - and I've never heard of
anything like that.

You'd have to try asking on the MS Volt list or someone in Microsoft
Typography.

Most of what I see listed on the MS web-site is about support for east asian
(CJK) scripts in Win CE - nothing so far about any complex Indic or Arabic
scripts.

Personally I wouldn't expect support for complex scripts like Bengali to appear
in Windows CE until some time after all the main complex scripts are fully
supported in Windows XP.  Uniscribe (USP10.DLL) is constantly being updated
with support for new scripts and it would seem to make sense to make a version
for Win CE only once Uniscribe already has support for more or less all the
scripts they plan to support. That is unless there is a huge commercial demand
for complex script support in Win CE and it is both practical and commercially
worth while for them to implement it.

OpenType fonts for complex scripts on Windows CE would  need very good hinting
and ClearType to be useable since text is rendered at a small size. There is
probably also the issue of getting handwriting recognition for scripts like
Bengali to work well since that is the main input method for many CE devices.

 - Chris




RE: Oriya: nndda / nnta?

2003-11-29 Thread Peter Constable
 -Original Message-
 From: Michael Everson [mailto:[EMAIL PROTECTED]


 Pronounced as you mean it here refers to the
 reading rules, not the structure of the script.

That seems to me to be saying we should be encoding the structure of the
script (a statement I'd agree with in general).


 It can't be a NNTA
 since that would assimilate to NNTTA.

Wouldn't it be more likely for a nasal to assimilate to an obstruent
rather than the other way? (We say 'impossible', not 'intossible'.)

But that statement is following phonology, not the structure of the
script. Your statements seem inconsistent to me.

The question is, do we encode something based on it's shape, or based on
the phonemes it represents. Following clear cases, the shape is that of
TA. NN.TA is phonologically unlikely, though, whereas NN.TTA or NN.DDA
is phonologically plausible; so, on the other hand, we could say it
makes little sense to encode NN.TA, and so should encode this as NN.DDA.

I guess I'd be inclined to go with that reasoning, though I have
encountered an NN.DDA conjunct that uses a subjoined small DDA in a font
(see attached); haven't encountered that in texts so far, though.


 Besides my book gives NNDDA
 explicitly as being made of NNA and DDA and has the same glyph.

OK, that's two sources that indicate this. I'll go with that. 

 
 The book is Learn Oriya in 30 Days, a 150-page introductory grammarin
 the National Integration Language Series.

Thanks for the reference. I've tracked down a copy and it's on its way.



Peter Constable




attachment: sandnya_or_411.png

RE: MS Windows and Unicode 4.0 ?

2003-11-29 Thread Peter Constable
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On
 Behalf Of Patrick Andries


 Is there any plan for Microsoft to support Unicode 4.0, distribute
with
 its
 operating system the corresponding fonts and update the corresponding
 Character Map tools/charts (Office and OS) ?

Of course, there is a certain vagueness to the question surrounding the
issue of what it means to say product X supports Unicode 4.0.

However you understand it, we are making steady progress in that
direction, in that we are continuing to broaden support in all kinds of
services the OS provides. Note that that might mean that we'll provide
underlying support for a particular script if fonts and input methods
are supplied from somewhere else.

I don't know when Character Map will be updated. What happens in Office
- e.g. Insert|Symbol - someone else would have to answer, though I think
that the Insert|Symbol dialog follows what's in the selected font for
the characters it shows and for the scripts it lists as subsets
(following the Unicode bitfield in the OS/2 table -- I'd need to do some
testing to be sure).



Peter Constable




RE: Oriya: mba / mwa ?

2003-11-29 Thread Peter Constable
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On
 Behalf Of Michael Everson


 I think the TDIL chart is wrong.

It seems reasonable that one should need extra persuasion to take the
word of an American living in Ireland over Indians. (Sorry.)

 
 Traditionally (as in Learn Oriya in 30 Days) subjoined BA is used in
 this context although the reading rules say to pronounce it [w].

So, you're saying that all of these should be encoded as C + virama +
BA?


 Now an original ligature of O and BA has been pressed into service

I've seen elsewhere that you've described this as a ligature involving
O, but are you sure it's that? Note that the same shape is used for NYA
and NNA (e.g. conjuncts for NN.NNA and SS.NNA).


 The traditional BA should be used for that unless we have better
 evidence than the TDIL newsletter that such should be the practice.

I could be convinced of that; but if people in India aren't convinced of
that, the boat may not float.

 

Peter Constable





RE: Oriya: nndda / nnta?

2003-11-29 Thread Michael Everson
At 12:32 -0800 2003-11-29, Peter Constable wrote:

  Pronounced as you mean it here refers to the
 reading rules, not the structure of the script.
That seems to me to be saying we should be encoding the structure of the
script (a statement I'd agree with in general).
Sure.

  It can't be a NNTA
 since that would assimilate to NNTTA.
Wouldn't it be more likely for a nasal to assimilate to an obstruent
rather than the other way? (We say 'impossible', not 'intossible'.)
The dental t assimilates to the retroflex n.

But that statement is following phonology, not the structure of the
script. Your statements seem inconsistent to me.
I'm saying that the syllable NNTA isn't a probable syllable, because 
it would assimilate to NNTTA, while NNDDA is a phonetically normal 
syllable, which is the answer to your question.

The question is, do we encode something based on it's shape, or 
based on the phonemes it represents.
It's Brahmic. We encode according to the characters used to write the 
phonemes. The glyph shape is secondary.

Following clear cases, the shape is that of TA.
The shape in my source shows the same shape for subjoined TA and DDA.

NN.TA is phonologically unlikely, though, whereas NN.TTA or NN.DDA
is phonologically plausible; so, on the other hand, we could say it
makes little sense to encode NN.TA, and so should encode this as NN.DDA.
That's correct.

I guess I'd be inclined to go with that reasoning, though I have
encountered an NN.DDA conjunct that uses a subjoined small DDA in a font
(see attached); haven't encountered that in texts so far, though.
Well. Where did you encounter it?

  Besides my book gives NNDDA
 explicitly as being made of NNA and DDA and has the same glyph.
OK, that's two sources that indicate this. I'll go with that.
Good.

  The book is Learn Oriya in 30 Days, a 150-page introductory grammarin
  the National Integration Language Series.
Thanks for the reference. I've tracked down a copy and it's on its way.
I'm sure it's in http://www.evertype.com/scriptbib.html
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Need to update Technical Work page

2003-11-29 Thread Doug Ewell
I noticed the following on the Technical Work page on the Unicode Web
site, at http://www.unicode.org/techwork.html:

The Unicode Standard was the basis for the Universal Character Set,
two-octet form (UCS-2) of ISO/IEC 10646. The Unicode Standards 65,536
code values are the first 65,536 code values of ISO 10646.

I wonder if this passage is very old, predating the full acceptance of
the surrogate mechanism in the Unicode Standard.  I suggest this text be
revised to avoid perpetuating the common misconception that Unicode is a
16-bit-only standard, or that Unicode and ISO 10646 have different
repertoires.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: Compression through normalization

2003-11-29 Thread Doug Ewell
Someone, I forgot who, questioned whether converting Unicode text to NFC
would actually improve its compressibility, and asked if any actual data
was available.

Certainly there is no guarantee that normalization would *always* result
in a smaller file.  A compressor that took advantage of normalization
would have to determine whether there would be any benefit.

One extremely simple example would be text that consisted mostly of
Latin-1, but contained U+212B ANGSTROM SIGN and no other characters from
that block.  By converting this character to its canonical equivalent
U+00C5:

* UTF-8 would use 2 bytes instead of 3.
* SCSU would use 1 byte instead of 2.
* BOCU-1 would use 1 or 2 bytes instead of always using 2.

A longer and more realistic case can be seen in the sample Korean file
at:

http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt

This file is in EUC-KR, but can easily be converted to Unicode using
recode, SC UniPad, or another converter.  It consists of 3,317,215
Unicode characters, over 96% Hangul syllables and Basic Latin spaces,
full stops, and CRLFs.  When broken down into jamos (i.e. converting
from NFC to NFD), the character count increases to 6,468,728.

The entropy of the syllables file is 6.729, yielding a Huffman bit
count of 22.3 million bits.  That's the theoretical minimum number of
bits that could be used to encode this file, character by character,
assuming a Huffman or arithmetic coding scheme designed to handle 16- or
32-bit Unicode characters.  (Many general-purpose compression algorithms
can do better.)  The entropy of the jamos file is 4.925, yielding a
Huffman bit count of 31.8 million bits, almost 43% larger.

When encoded in UTF-8, SCSU, or BOCU-1, the syllables file is smaller
than the jamos file by 55%, 17%, and 32% respectively.

General-purpose algorithms tend to reduce the difference, but PKZip
(using deflate) compresses the syllables file to an output 9% smaller
than that of the jamos file.  Using bzip2, the compressed syllables file
is 2% smaller.

So we can at least say that Korean, which can be normalized from NFD to
NFC algorithmically and without the use of long tables of equivalents or
exclusions, can consistently be compressed to a smaller size after such
normalization than before.

Whether a silent normalization to NFC can be a legitimate part of
Unicode compression remains in question.  I notice the list is still
split as to whether this process changes the text (because checksums
will differ) or not (because C10 says processes must consider the text
to be equivalent).

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




RE: Oriya: mba / mwa ?

2003-11-29 Thread Michael Everson
At 13:17 -0800 2003-11-29, Peter Constable wrote:

  I think the TDIL chart is wrong.

It seems reasonable that one should need extra persuasion to take 
the word of an American living in Ireland over Indians. (Sorry.)
Peter, I would take those TDIL publications with a very large grain 
of salt. Textual evidence is not given and there's all sorts of of 
stuff which really doesn't fit in well with the way we do things in 
Unicode. Like their *U+0B3A ORIYA INVISIBLE LETTER.

Just because it comes from India doesn't mean it's not revisionist.

  Traditionally (as in Learn Oriya in 30 Days) subjoined BA is used in
  this context although the reading rules say to pronounce it [w].
So, you're saying that all of these should be encoded as C + virama + BA?
Yes, I am. KA + BA = KBA pronounced [kwa]. That's what Learn Oriya in 
30 days shows explicitly.

  Now an original ligature of O and BA has been pressed into service

I've seen elsewhere that you've described this as a ligature involving
O, but are you sure it's that?
Yes, I am.

Note that the same shape is used for NYA
and NNA (e.g. conjuncts for NN.NNA and SS.NNA).
Be thou not deceived by the glyph shapes. The etymology is O + BA = 
WA, not NYA + BA.

 The traditional BA should be used for that unless we have better
 evidence than the TDIL newsletter that such should be the practice.
I could be convinced of that; but if people in India aren't convinced of
that, the boat may not float.
WA is an innovation, unattested in earlier Oriya. You won't find it 
in Learn Oriya in 30 Days, for instance. Yet syllables in -[wa] have 
been written in Oriya for a long time, with BA.

Note that a historical VA exists and predates the WA, and the TDIL 
does not take this into account. We did encode it however.

I have just ordered two large Oriya dictionaries which should arrive 
in a fortnight.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Unicode for Windows CE

2003-11-29 Thread Michael \(michka\) Kaplan
I also sincrely doubt that MSKLC will create keyboards that will work on a
CE device, to tell you the truth. Maybe they do, but they have never been
tested there and I would be surprised if they had no problems (never forget
the First Tester's Axiom!).

MichKa [MS]
NLS Collation/Locale/Keyboard Development
Globalization Infrastructure and Font Technologies

- Original Message - 
From: Christopher John Fynn [EMAIL PROTECTED]
To: Unicode List [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Saturday, November 29, 2003 10:35 AM
Subject: Re: Unicode for Windows CE



  Thanks for the link. It is good to know that MSKLC can be used for
creating
  Keyboard Driver for WinCE. But is it true only truetype fonts can be
used. No
  OTF?
  Thanks and refgares
  Mustafa Jabbar

 I doubt that PostScript flavour OpenType fonts can be used since that
would
 require some form of Adobe Type Manager in Windows CE. Simple TrueType
flavour
 OpenType fonts that don't require Uniscribe probably work but for complex
 script layout for scripts such as Bangla / Bengali there would have to be
the
 equivalent of USP10.DLL  running in Windows CE - and I've never heard of
 anything like that.

 You'd have to try asking on the MS Volt list or someone in Microsoft
 Typography.

 Most of what I see listed on the MS web-site is about support for east
asian
 (CJK) scripts in Win CE - nothing so far about any complex Indic or Arabic
 scripts.

 Personally I wouldn't expect support for complex scripts like Bengali to
appear
 in Windows CE until some time after all the main complex scripts are fully
 supported in Windows XP.  Uniscribe (USP10.DLL) is constantly being
updated
 with support for new scripts and it would seem to make sense to make a
version
 for Win CE only once Uniscribe already has support for more or less all
the
 scripts they plan to support. That is unless there is a huge commercial
demand
 for complex script support in Win CE and it is both practical and
commercially
 worth while for them to implement it.

 OpenType fonts for complex scripts on Windows CE would  need very good
hinting
 and ClearType to be useable since text is rendered at a small size. There
is
 probably also the issue of getting handwriting recognition for scripts
like
 Bengali to work well since that is the main input method for many CE
devices.

  - Chris







Brahmic list ? (was: Oriya: mba / mwa ?)

2003-11-29 Thread Philippe Verdy
Michael Everson writes:
 Peter Constable wrote:
 
I think the TDIL chart is wrong.
 
 It seems reasonable that one should need extra persuasion to take 
 the word of an American living in Ireland over Indians. (Sorry.)

Isn't there a specific list for Brahmic scripts? ([EMAIL PROTECTED] ???).

We are near to explode the number of issues with these scripts if Indian
sources start publishing new undated references for their encoding and
conversion to Unicode, including proposed changes of orthographic rules to
better match either the phonology or the tradition or the inclusion of
foreign terms.

SIL.org also is working quite actively in this area, in relation with a
proposed extended UTR22 reference for transcoding. But I'd like to see
discussions about proposed UTR22 changes in the main Unicode list.

There's not much isues with Thai as it has been standardized since long in
TIS620, which was the base of Unicode encoding (but shamely before UTR22 was
produced which would have allowed a better logical encoding without needing
lexical dictionnaries to parse the Thai text). Semantic analysis of Thai
text is an interesting issue by itself, but not for the correct way to
encode Thai words (TIS620 rules are clear as it mostly encodes glyphs,
expecting that readers will interpret the written text using their knowledge
of the language). So Thai discussions can remain in the main list.

I also think that Tibetan issues should be discussed in that list, despite
its composition model is very different from Brahmic scripts of India,
unless there's a specific rapporteur group for it.

But not Han issues which should be discussed possibly in their own list in
relation with the IRG workgroup (which already works on its own technical
reports as well as the standardization of the extended repertoire).

The recent issues I have read seem to multiply the number of Brahmic
conjuncts we have to deal with, possibly in relation with new normalization
forms (not NFC and NFD); as for Hebrew, there's probably a need for work in
these scripts with a separate discussion list, with the aim to produce a
technical report in accordance to Indian sources. Other related South Asian
scripts should be there too: Lao, Khmer...

My recent works with UCA and collation, as well as UTR22 and phonologic
analysis of many texts tend to promote the idea of new normalization forms
in all areas where NFC/NFD or even NFKC/NFKD are failing (we can't change
them due to the stability pact, but UCA and collation in general seems to
create a new coded character set (made of ordered collation weights
belonging to separate ranges for each collation level, these ranges being
sorted in the reverse order of the collation level).

I've tried to experiment a collation algorithm to implement UCA by the same
system as used in UCD decompositions, but with added (and sometimes
modified) decompositions. This system creates new code points needed to
represent only font compatibility differences, ligatures, or alternate
forms, as a decomposition of the existing compatibility character, into more
basic characters exposed with primary differences in UCA, plus these new
characters given variable collation weights, which may be ignorable in
applications which ignore extra levels. This encoding uses a 31 bit code
space, which is still highly compressible, but still representable with the
UTF-8 TES (but they are not containing Unicode code points) or similar
ad-hoc representation.

I am currently trying to adapt this system to work in relation with UTR22
transcodings, and I am testing it against Brahmic scripts, Hebrew, and
Latin. This is very promizing, and my next step will be to handle
decomposition of Han characters into their component radicals and strokes. I
do think that it is possible to handle almost all UCA and UTR22 rules by
using UTR22 itself and decomposition rules in a simple table matching nearly
the format of the UCD.

But all these discussions and encoding ambiguities of Brahmic scripts are
polluting my work. I am quite near to remove my current work on them, until
there's some agreement found, notably within an revision of ISCII if there's
one in preparation which will be more precise and will give more precise
rules. For now it is impossible for me to adapt my model with the proposed
(and sometimes contradictory) encoding solutions proposed by distinct
people.


__
 ella for Spam Control  has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
attachment: winmail.dat

Re: Brahmic list ? (was: Oriya: mba / mwa ?)

2003-11-29 Thread Doug Ewell
Philippe Verdy verdy underscore p at wanadoo dot fr wrote:

 I've tried to experiment a collation algorithm to implement UCA by the
 same system as used in UCD decompositions, but with added (and
 sometimes modified) decompositions. This system creates new code
 points needed to represent only font compatibility differences,
 ligatures, or alternate forms, as a decomposition of the existing
 compatibility character, into more basic characters exposed with
 primary differences in UCA, plus these new characters given variable
 collation weights, which may be ignorable in applications which ignore
 extra levels. This encoding uses a 31 bit code space, which is still
 highly compressible, but still representable with the UTF-8 TES (but
 they are not containing Unicode code points) or similar ad-hoc
 representation.

Please don't use UTF-8 to encode anything other than Unicode code
points.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: Brahmic list ? (was: Oriya: mba / mwa ?)

2003-11-29 Thread Christopher John Fynn

 Philippe Verdy [EMAIL PROTECTED] wrote:

 I also think that Tibetan issues should be discussed in that list, despite
 its composition model is very different from Brahmic scripts of India,
 unless there's a specific rapporteur group for it.

There already is a specific list for Tibetan script issues:
 [EMAIL PROTECTED]