Internationalization API: Collator sensitivity

Norbert Lindenberg Wed, 16 May 2012 14:35:08 -0700

In recent discussions between Markus Scherer, Nebojša Ćirić, Mark Davis 
(Google), Eric Albright (Microsoft) and myself, a few issues around the 
sensitivity option of the Collator constructor in the ECMAScript 
Internationalization API [1, section 11.3.2] have come up. It would be good to 
get input from a wider audience.


1) The "variant" sensitivity: This name isn't very descriptive. When "variant" 
is selected, a collator has to take all differences between input strings into 
account that it considers at the "case" and "accent" levels; it may consider 
additional differences. New names have been proposed:
- "accent+case": mnemonic, but doesn't indicate that additional differences may 
be considered.
- "common"
- "normal"
- "default": doesn't work because it's not actually the default in all cases.
- "full": doesn't work because implementations aren't required to take all 
differences into consideration.
- "distinct", "dissimilar", "varied", "inherent", "intrinsic", "essential": not 
really descriptive.

An alternative would be to use the terminology of the Unicode Collation 
Algorithm [2] even though implementations do not have to follow that spec, so 
there would be sensitivity values "primary", "primary+caseLevel", "secondary", 
"tertiary", "quaternary", "identical". The problem here is that implementations 
may not actually have all these levels. The current "variant" can fall anywhere 
between "tertiary" and "identical".

I'm leaning towards renaming "variant" to "accent+case", with a note "Other 
differences, such as those between hiragana and katakana, may compare as 
unequal as well.".

2) The description of the sensitivity values seems to use the term "width" for 
the difference between the hiragana characters あ and ぁ. In the usage of the 
Unicode standard, these two characters are normal and small, while "width" 
refers to the difference between normal and full-width Latin characters such as 
A and Ａ, or normal and half-width katakana characters such as ア and ｱ (katakana 
characters also have small variants such as ァ).

Implementations don't agree on their interpretation of these differences:
- あ vs ぁ is interpreted as either a difference in case or a difference in 
accent.
- あ vs ア vs ｱ is locale dependent in ICU.

My proposed resolution: Remove references to width and the comparison of あ and 
ぁ from the spec.

3) The term "accent" is too narrow - differences in other diacritics should be 
considered along with accents. When mentioning diacritics, however, it becomes 
necessary to clarify that some languages treat some characters with diacritics 
as base letters.

My proposed resolution: Keep "accent" as the value of the sensitivity option, 
but add "or other diacritics" to "accent" in the descriptions. Add a note: "In 
some languages, some characters with diacritics sort as separate base letters. 
For example, Swedish treats 'å', 'ä' and 'ö' as base letters separate from 'a' 
and 'o'."

Comments?

Norbert

[1] http://wiki.ecmascript.org/doku.php?id=globalization:specification_drafts
[2] http://unicode.org/reports/tr10/
_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Internationalization API: Collator sensitivity

Reply via email to