In recent discussions between Markus Scherer, Nebojša Ćirić, Mark Davis (Google), Eric Albright (Microsoft) and myself, a few issues around the sensitivity option of the Collator constructor in the ECMAScript Internationalization API [1, section 11.3.2] have come up. It would be good to get input from a wider audience.
1) The "variant" sensitivity: This name isn't very descriptive. When "variant" is selected, a collator has to take all differences between input strings into account that it considers at the "case" and "accent" levels; it may consider additional differences. New names have been proposed: - "accent+case": mnemonic, but doesn't indicate that additional differences may be considered. - "common" - "normal" - "default": doesn't work because it's not actually the default in all cases. - "full": doesn't work because implementations aren't required to take all differences into consideration. - "distinct", "dissimilar", "varied", "inherent", "intrinsic", "essential": not really descriptive. An alternative would be to use the terminology of the Unicode Collation Algorithm [2] even though implementations do not have to follow that spec, so there would be sensitivity values "primary", "primary+caseLevel", "secondary", "tertiary", "quaternary", "identical". The problem here is that implementations may not actually have all these levels. The current "variant" can fall anywhere between "tertiary" and "identical". I'm leaning towards renaming "variant" to "accent+case", with a note "Other differences, such as those between hiragana and katakana, may compare as unequal as well.". 2) The description of the sensitivity values seems to use the term "width" for the difference between the hiragana characters あ and ぁ. In the usage of the Unicode standard, these two characters are normal and small, while "width" refers to the difference between normal and full-width Latin characters such as A and A, or normal and half-width katakana characters such as ア and ア (katakana characters also have small variants such as ァ). Implementations don't agree on their interpretation of these differences: - あ vs ぁ is interpreted as either a difference in case or a difference in accent. - あ vs ア vs ア is locale dependent in ICU. My proposed resolution: Remove references to width and the comparison of あ and ぁ from the spec. 3) The term "accent" is too narrow - differences in other diacritics should be considered along with accents. When mentioning diacritics, however, it becomes necessary to clarify that some languages treat some characters with diacritics as base letters. My proposed resolution: Keep "accent" as the value of the sensitivity option, but add "or other diacritics" to "accent" in the descriptions. Add a note: "In some languages, some characters with diacritics sort as separate base letters. For example, Swedish treats 'å', 'ä' and 'ö' as base letters separate from 'a' and 'o'." Comments? Norbert [1] http://wiki.ecmascript.org/doku.php?id=globalization:specification_drafts [2] http://unicode.org/reports/tr10/ _______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

