Mark *— Il meglio è l’inimico del bene —*
On Thu, Jul 29, 2010 at 05:57, Philippe Verdy <verd...@wanadoo.fr> wrote: > "Martin J. Dürst" <due...@it.aoyama.ac.jp> wrote: > > > > On 2010/07/29 13:33, karl williamson wrote: > > > Asmus Freytag wrote: > > >> On 7/25/2010 6:05 PM, Martin J. Dürst wrote: > > > > >>> Well, there actually is such a script, namely Han. The digits (一、 > > >>> 二、三、四、五、六、七、八、九、〇) are used both as letters and as > > >>> decimal place-value digits, and they are scattered widely, and of > > >>> course there are is a lot of modern living practice. > > > > >> The situation is worse than you indicate, because the same characters > > >> are also used as elements in a system that doesn't use place-value, > > >> but uses special characters to show powers of 10. > > > > No. Sequences of numeric Kanji are also used in names and word-plays, > > and as sequences of individual small numbers. > > (1) Existing exception : > > There's one example of a digit which has a numeric type = decimal, AND > is encoded in a "scattered" way: > > 19DA;6618;᧚;New Tai Lue Tham Digit One;Nd;0;L;...;1;1;1;N > > The other decimal nine digits for the Tham variant of the New Tai Lue > digits are borrowed from another sequence of decimal digits, starting > at U+19D0 (for digit zero) with the exception of U+19D1 which is > replaced (for digit one). Both sets are assigned in the same > "New_Tai_Lue" script property value. > > So the additional stability proposal will not be enforceable. > On the contrary. Were we do want such a policy, the implication would be either to: (a) change the type of 19DA from Nd to No (what I think would be the right thing to do) (b) grandfather in the character. > > (2) Arabic digits : > > Such case was avoided for the Eastern/Extended variant of Arabo-Indic > digits in U+06F0..U+06F9, without borrowing the common forms for the > Standard variant in U+0660.U+0669: they were reencoded separately to > create a complete sequence of 10 digits, even if most of them (all > except 4 to 6) are exactly similar and belong to the same unified > "script". > > But what is even more "strange" is that the Standard Arabic digits are > assigned to the "Common" script, when the Eastern/Extended variant is > assigned to the "Arabic" script (look at the Unicode script property > value, from the file "Scripts-5.2.0.txt" in the UCD). > > If you just look at this property, you may think that the > Extended/Eastern digits are the standard ones for the Arabic script: > this is a side-effect of unification of Western and Eastern variants > of the Arabic script. > It is not so strange. Read http://www.unicode.org/reports/tr24/proposed.html#Multiple_Script_Values, and other parts of #24 describing Common. > > > (3) Unification of the Arabic script: > > Ideally, there should be two additional separate ISO 15924 script > codes for the Western and Eastern variants the Arabic script (possibly > [Arbs] for Standard/Western, and [Arbx] for Extended/Eastern), and the > Unicode "script" property value alias for the Western and Eastern > digits or letters should be segregated, using a separate Script > property value (splitting the Arabic script, where it is significant, > just like it occured for Georgian and Greek/Coptic alphabets). > There is no likelihood of that happening, simply for the sake of these digits. The original characters were just font variants; they were really split to a large extend because of the UBA (which I think in retrospect was a mistake, but c'est la vie, n'est pas?). > Nothing will be changed for the existing Arabic script, but the > "Extended/Eastern Arabic" script (assigned with a new ISO 15924 code > and mapped with a new property alias in Unicode), will still borrow > most of its letters from the standard script without reencoding them. > > No character or block will be renamed (and I DO NOT propose to > disunifying existing common Arabic letters, or assigning them in the > "Common" script), it should just be a better sub-classification, where > the characters are clearly distinguished between the two variants. > > Most Arabic characters should remain in the common "Arabic" script, > and those that are differentiated should be assigned in a > "Standard_Arabic" or "Extended_Arabic" script. But this may cause some > complication for the script inheritance in spans of texts (because the > "Arabic" script property value would behave a bit like what the > "Common" does for alphabetic scripts, i.e. like a group of scripts). > > Such change for the assigned script property value (if it's not > already stabilized) would require documentation, and changes in a few > other core or derived datafiles: > > - PropertyValueAliases.txt (adding two new property values for "sc"): > sc ; Arab ; Arabic # All forms, includes "sc=Arbc", "sc=Arbs" and > "sc=Arbx" in regexps) > sc ; Arbc ; Common_Arabic > sc ; Arbs ; Standard_Arabic # (also includes "sc=Arbc" in regexps) > sc ; Arbx ; Extended_Arabic # (also includes "sc=Arbc" in regexps) > > - Script.txt (assigning the two new property values to remap existing > "Arabic") > - Arabic-Shaping.txt (possibly adding comments at end of lines where > this is not the Common Arabic) > - Joining-Groups.txt (same remark) > - Bidi-Mirroring.txt (same remark) > > And in the description of some standard script identification and > segmentation algorithms. I don't know if IDNA should continue to use > "Arab" (all forms) or if it should segregate "Arbs" and "Arbx" (to > avoid mixing digits that are visually confusable), as it uses such > segmentation (note that these characters are canonically different, > for normalization purposes). > > Philippe. > > >