Mono.Globalization.Unicode

Atsushi Enomoto ([EMAIL PROTECTED]) Tue, 26 Apr 2005 09:55:58 -0700

Author: atsushi
Date: 2005-04-26 12:21:11 -0400 (Tue, 26 Apr 2005)
New Revision: 43600


Modified:
   branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/ChangeLog
   
branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/Collation-notes.txt
Log:
2005-04-26  Atsushi Enomoto  <[EMAIL PROTECTED]>

        * Collation-notes.txt : more updates.



Modified: branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/ChangeLog
===================================================================
--- branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/ChangeLog      
2005-04-26 15:30:47 UTC (rev 43599)
+++ branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/ChangeLog      
2005-04-26 16:21:11 UTC (rev 43600)
@@ -1,5 +1,9 @@
 2005-04-26  Atsushi Enomoto  <[EMAIL PROTECTED]>
 
+       * Collation-notes.txt : more updates.
+
+2005-04-26  Atsushi Enomoto  <[EMAIL PROTECTED]>
+
        * Collation-notes.txt : some updates.
        * create-mapping-char-source.cs : superscripts and subscripts are also
          ignored in IgnoreWidth comparison.

Modified: 
branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/Collation-notes.txt
===================================================================
--- 
branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/Collation-notes.txt
    2005-04-26 15:30:47 UTC (rev 43599)
+++ 
branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode/Collation-notes.txt
    2005-04-26 16:21:11 UTC (rev 43600)
@@ -1,11 +1,11 @@
-String collation
+* String collation
 
-* Summary
+** Summary
 
        We are going to implement Windows-like collation, apart from ICU which
        is conformant to Unicode specifications.
 
-* CompareInfo members
+** How to implement CompareInfo members
 
        GetSortKey()
                Compute sort key for every characters into byte[].
@@ -25,7 +25,7 @@
                Find first match and process comparison to the end of the 
                string to find
 
-* CompareOptions support
+** How to support CompareOptions
 
        There are two kind of "ignorance" : ignorance which acts as stripper,
        and ignorance acts as normalizer.
@@ -44,7 +44,7 @@
        For LCID 101/1125(div), '\ufdf2' is completely ignorable.
        This rule even applies to CompareOptions.None.
 
-** Normalizers
+*** Normalizers
 
        IgnoreCase
                Maybe culture-dependent TextInfo.ToLower() could be used.
@@ -59,7 +59,7 @@
                ToWidthInsensitive(), which is likely to be culture
                independent. See also "Notes".
 
-** Strippers
+*** Strippers
 
        I already wrote all the required strippers which should be MS
        compatible (at least with .NET 1.1 invariant culture).
@@ -79,29 +79,22 @@
                        LCID 17/1041(ja) : 2015
                        LCID 90/1114(syr) : 64b, 652
 
-** StringSort
+*** StringSort
 
        Maybe use additional tailoring rule which says that non-alphabetic
        characters does not take precedence.
 
-* CharacterIterator
+** ICU and UCA
 
-       The match evaluation could not be done only one character - the longest
-       possible sequence of characters in the tailored table (e.g. "ch" 
-       in Spanish) should be examined.
+       First to note: we won't use collation element table from unicode.org.
 
-* Collation element table tailoring
+*** Collation element table tailoring
 
-       Deprecated; We won't use collation element table from unicode.org.
+       To understand why we don't use collation element table from UCA, you
+       can try to compare "A" and "a" in the invariant culture.)
 
-       We will contain only the default element table and Chinese table.
-       (Japanese might be added too, since CLDR contains a large table for it)
+** Notes
 
-       Other rules are always "evaluated"; no physical expansion is done to
-       the table loaded in memory (it's too wasting).
-
-* Notes
-
        Since UCA Level 3 handles both casing and width, it is impossible to
        use UCA variables for IgnoreWidth, at least with the default element
        table. And IgnoreKanaType cannot be handled without case and width
@@ -117,13 +110,15 @@
        Myanmar, Mongolian, Cherokee, Etiopic, Tagalog, Khmer, are regarded as
        "completely ignorable".
 
-* MS collation design inference
+** MS collation design inference
 
 ** sort key format
 
        00 means the end of sort key.
        01 means the end of the level.
        02-FF means the value.
+       If less than or equal to 2 in followings in a level, then the sequence
+       of the level is terminated (1). 2 is the default.
 
        There are 5 levels.
 
@@ -134,11 +129,61 @@
        - level 4: kana type (mostly at primary category 22)
        - level 5: control characters etc.
 
-** default
+** sort key table
 
-       So the problem is, how to detect diacritic. Maybe they are combined
-       similarly to what is specified in UCA.
+       Here is the simple sortkey dumper:
 
+       public static void Main (string [] args)
+       {
+               CultureInfo culture = args.Length > 0 ?
+                       new CultureInfo (args [0]) :
+                       CultureInfo.InvariantCulture;
+               CompareInfo ci = culture.CompareInfo;
+               for (int i = 0; i < char.MaxValue; i++) {
+                       string s = new string ((char) i, 1);
+                       if (ci.Compare (s, "") == 0)
+                               continue; // ignored
+                       byte [] data = ci.GetSortKey (s).KeyData;
+                       foreach (byte b in data) {
+                               Console.Write ("{0:X02}", b);
+                               Console.Write (' ');
+                       }
+                       Console.WriteLine (" : {0:X}, {1} {2}",
+                               i,
+                               Char.GetUnicodeCategory ((char) i),
+                               data [2] != 1 ? '!' : ' ');
+               }
+       }
+
+*** Combined characters
+
+       Some latin+diaeresis sequences are regarded as a single character for
+       each.
+
+       Maybe they are combined similarly to what is specified in UCA.
+
+*** Expanded characters
+
+       Some characters are expanded to two or more characters:
+
+       C6 (AE), E6 (ae), 1F1-1F3 (dz), 1C4-1C6 (Dz), FB00-FB06 (ff, fi),
+       132-133 (IJ), 1C7-1C9 (LJ), 1CA-1CC (NJ), 152-153 (OE),
+       DF (ss), FB06 (st), FB05 (\u017Ft), FE, DE, 5F0-5F2,
+       1113-115F (hangul)
+       (CJK extension is not really expanded)
+
+       They don't match with any of Unicode normalization.
+
+       Some alphabetic cultures have different mappings, but mostly small
+       (at least da-DK, lt-LT, fr-FR, es-ES have tiny differences).
+
+       Invariant culture also puts Czech unique character \u0161 between s
+       and t, unlike described here:
+       
http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C0.asp
+
+       zh-CHS, ko-KR and ja-JP have very different CJK mapping for each
+       (but might be just a matter of computation formula differences).
+
 *** sort order categories
 
        1 (0) specially ignored ones (Japanese, Tamil, Thai)
@@ -160,58 +205,73 @@
        2.4 Arabic spacing and equivalents (64B-651, FE70-FE7F)
          They are part of nonspacing mark, but not equal.
 
-       2.5 Nonspacing marks mixed
+       3 (1) Nonspacing marks mixed
          30D, 591-5C2, Mn:981-A3C, A4D, A70, A71, ABC, ABD ...
 
-       3 (7) space separators and some kind of marks
+       4 (7) space separators and some kind of marks
 
-       3.1 whitespaces, paragraph separator etc.
+       4.1 whitespaces, paragraph separator etc.
+         (White_Space in PropList.txt)
 
-       3.2 other marks ('!', '^', ...)
+       4.2 other marks ('!', '^', ...)
 
-       4 (8) mathmatical symbols
+       5 (8) mathmatical symbols
 
-       5 (9) some other symbols
+       6 (9) some other symbols
 
-       6 (A) punctuations
+       7 (A) punctuations
 
-       7 (C) numbers
+       8 (C) numbers
 
-       8 (E) latin letters (alphabets)
+       9 (E) latin letters (alphabets)
+         upper is 18, lower is 2 (default), diacritics are 19 or more.
 
-       9 (F) greek letters
+       10 (F) greek letters
 
        ...
 
           (21) georgian letters
 
-       13 (22) japanese kana letters and symbols
+       11 (22) japanese kana letters and symbols
 
-       14 (23) bopomofo letters
+       12 (23) bopomofo letters
 
-       15 (24) syriac letters
+       13 (24) syriac/thaana letters
 
-       16 (41-45) surrogate Pt.1
+       14 (41-45) surrogate Pt.1
 
-       17 (52-7E) hangul
+       15 (52-7E) hangul, mixing combined ones
+          52 02 .. 7E C8
 
-       18 (9E-FE) CJK (kangxi etc.), PrivateUse mixed, surrogate Pt.2
+       16 (9E-FE) CJK (kangxi etc.), PrivateUse mixed, surrogate Pt.2
+          9E 02 .. FE C1
 
-       19 (FE) CJK extensions (3400-)
+       17 (FE) CJK extensions (3400-)
+          FE FF 10 02 .. FE FF 29 E9
 
-       20 (FF) Some supplemental Japanese/Arabic marks
+       18 (FF) Some supplemental Japanese/Arabic marks
 
-** Traditional Spanish
 
-       It has some combined characters as a unique character (like 'll').
+** Mono implementation plans
 
-** Czech
+*** sort key element table
 
-       Invariant culture also puts Czech unique character \u0161 between s
-       and t, unlike described here:
-       
http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24C0.asp
+       We will contain our own collation element table which will be closer
+       to the one from Windows.
 
-** Other locales
+       Culture-dependent rules are always "evaluated"; no physical expansion
+       is done to the table loaded in memory (it's waste of memory).
 
-       There are some character reorderings.
+*** CharacterIterator
 
+       The match evaluation could not be done char by char - the longest
+       possible sequence of characters in the tailored table (e.g. "ch" 
+       in Spanish) should be examined. It will be like non-NFD detection.
+
+
+*** Reference materials
+
+       Developing International Software for Windows 95 and Windows NT
+       Appendix D Sort Order for Selected Languages
+       
http://www.microsoft.com/globaldev/dis_v1/disv1.asp?DID=dis33d&File=S24BF.asp
+

_______________________________________________
Mono-patches maillist  -  [email protected]
http://lists.ximian.com/mailman/listinfo/mono-patches

[Mono-patches] r43600 - branches/atsushi/mcs/class/corlib/Mono.Globalization.Unicode

Reply via email to