Re: [classlib][text] regression in text module, a non-bug difference?

Tony Wu Wed, 20 Feb 2008 22:01:24 -0800

A little further study.

The collation is defined in CLDR. Please refer to the data in locale
"es" [1]. There is a block describing the traditional collation. I
quote a part of it below[2]. Let me try to explain a little bit about
this definition.


First, the term "traditional" is explicitly defined. You can also find
the definition in UTS#35[3] which says "For a traditional-style sort
(as in Spanish) ".

Second, the data[2] indicates that the rule in traditional spanish
locale should be ... C<ch<<<Ch<<<CH.  the tag <p> is "primary", which
is to say the "ch" is a  base-character.

The conclusion is there IS a tradition Spanish collation rule which
has a key "ch". The question is "Is it necessary for Harmony to
support it or just to be the same behavoir as RI?"

[1]
http://www.unicode.org/repository/*checkout*/cldr/common/collation/es.xml?rev=1.21

[2]
<collation type="traditional">
- <rules>
...
  <reset>C</reset>
  <p>ch</p>
  <t>Ch</t>
  <t>CH</t>
...
  </rules>
</collation>

[3]
http://www.unicode.org/reports/tr35/


On 2/20/08, Alexei Zakharov <[EMAIL PROTECTED]> wrote:
> ¡Buenos dìas!
>
> :) No, I'm not an expert in Spanish. But after reading your post I got
> an impression that we have support for additional variant of Spanish
> language comparing to RI. However, I've tried to find something about
> traditional Spanish variant in ICU locale browser and found nothing. I
> believe we should learn more about this problem before making any
> decision.
>
> Regards,
> Alexei
>
> 2008/2/19, Tony Wu <[EMAIL PROTECTED]>:
> > Hi, all
> >
> > I'm investigating the regression[1] in text module. Actually these 5
> > failures come down to one reason: the support of traditional Spanish
> > charactor "ch". Following is my understanding.
> >
> > My fix for HARMONY-5465 makes the Locale.toString be compatible with
> > RI. Before my commit, the toString() of the Locale with empty "contry"
> > field has only one underscore in the output but RI has two. For
> > instance, new Locale("es","","TRADITIONAL").toString() returns
> > "es_TRADITIONAL" in Harmony whereas "es__TRADITIONAL" in RI. Something
> > interesting, ICU makes use of the output of toString() as keyword to
> > indicate its Locale instance. That is to say, the 5 testcases passes
> > before because they have not been tested in real traditional Spanish
> > locale so that the character "ch" was interpreted as two separate
> > characters "c" and "h". That is why we can set the offset to 1 in our
> > testcases. After my commit, ICU find the right Spanish locale so that
> > its behavior is compatible with spec[2].
> >
> > One thing strange is that I can not get the traditional Spanish locale
> > in RI. RI behaves the same no mater whether there is a variant
> > "TRADITIONAL" or not. Spec does not say anything about the
> > "traditional", but I googled to know that from 1998 the character "ch"
> > has been cancelled in Spanish. I suppose that RI changed the behavior
> > of Spanish locale but forgot to modify the spec accordingly.
> >
> > BTW for the normal Spanish Locale(new Locale("es","ES")), we have the
> > same behavior with RI. Seems ICU supports the traditional Spanish in
> > the form of new Locale("es","","TRADITIONAL") but RI does not. Run
> > testcase below[3] on RI to show the differences.
> >
> > Is there any expert familiar with Spanish here? Neey your advice.
> >
> > [1]
> > http://people.apache.org/~smishura/r628209/Windows_x86/classlib-test/
> >
> > [2]
> > spec says,
> > For example, consider the following in Spanish:
> >
> >  "ca" -> the first key is key('c') and second key is key('a').
> >  "cha" -> the first key is key('ch') and second key is key('a').
> >
> >
> > [3]
> >         RuleBasedCollator rbColl = (RuleBasedCollator) Collator
> >                 .getInstance(new Locale("es", "", "TRADITIONAL"));
> >         String text = "cha";
> >         CollationElementIterator iterator = rbColl
> >                 .getCollationElementIterator(text);
> >         int keyNum = 0;
> >         while (iterator.next() != -1) {
> >             keyNum++;
> >         }
> >         System.out.println("RI has " + keyNum + " keys");
> >
> >         com.ibm.icu.text.RuleBasedCollator r =
> > (com.ibm.icu.text.RuleBasedCollator) com.ibm.icu.text.Collator
> >                 .getInstance(new Locale("es", "", "TRADITIONAL"));
> >         com.ibm.icu.text.CollationElementIterator it = r
> >                 .getCollationElementIterator(text);
> >         keyNum = 0;
> >         while (it.next() != -1) {
> >             keyNum++;
> >         }
> >         System.out.println("ICU has " + keyNum + " keys");
> >
> >
> >
> > The output is:
> > RI has 3 keys
> > ICU has 2 keys
> >
> >
> > --
> > Tony Wu
> > China Software Development Lab, IBM
> >
>


-- 
Tony Wu
China Software Development Lab, IBM

Re: [classlib][text] regression in text module, a non-bug difference?

Reply via email to