Re: Unicode::Collate question

SADAHIRO Tomoyuki Sat, 06 Dec 2003 01:24:51 -0800

> Has anyone had a look at the OpenI18N/ICU locale data?
> 
> The locales there are all UTF-8 and have java rule based collation data, so
> they *might* be useful for creating a more comprehensive (and accurate) set
> of sort modules? The downside is this data is pretty rough ATM but does
> seem to be improving slowly.
> 
> I guess p6 is going to use ICU as the basis for I18N - sure hope the APIs
> are easier though :)


The syntax of collation customization (tailoring) in ICU
 ( http://oss.software.ibm.com/icu/userguide/Collate_Customization.html )
is character-based and may be more intuitive:

   for French:
       "[backwards 2]&A << \u00e6/e <<< \u00c6/E"

   for Spanish:
       "&N < n\u0303 <<< N\u0303"
       "&C < ch <<< Ch <<< CH"
       "&l < ll <<< Ll <<< LL"

However Unicode::Collate also allows linguistic tailoring.
Certainly its interface requires hard code of weights and
may be less user-friendly.

#!perl
use strict;
use warnings;
use Unicode::Collate;

our (@listEs, @listFr);

my $objEs = Unicode::Collate->new(
    entry => <<'ENTRY', # for allkeys-4.0.0.txt
0063 0068 ; [.0E6A.0020.0002.0063] # ch
0043 0068 ; [.0E6A.0020.0007.0043] # Ch
0043 0048 ; [.0E6A.0020.0008.0043] # Ch
006C 006C ; [.0F4C.0020.0002.006C] # ll
004C 006C ; [.0F4C.0020.0007.004C] # Ll
004C 004C ; [.0F4C.0020.0008.004C] # LL
006E 0303 ; [.0F69.0020.0002.006E] # ñ
004E 0303 ; [.0F69.0020.0008.004E] # Ñ
ENTRY

#    entry => <<'ENTRY', # for allkeys-3.1.1.txt
#0063 0068 ; [.0A46.0020.0002.0063] # ch
#0043 0068 ; [.0A46.0020.0007.0043] # Ch
#0043 0048 ; [.0A46.0020.0008.0043] # Ch
#006C 006C ; [.0B1C.0020.0002.006C] # ll
#004C 006C ; [.0B1C.0020.0007.004C] # Ll
#004C 004C ; [.0B1C.0020.0008.004C] # LL
#006E 0303 ; [.0B38.0020.0002.006E] # ñ
#004E 0303 ; [.0B38.0020.0008.004E] # Ñ
#ENTRY
);


my $objFr = Unicode::Collate->new(
    backwards => 2,

    entry => <<'ENTRY', # for allkeys-4.0.0.txt
00E6 ; [.0E33.0020.0002.00E6][.0E8B.0020.0002.00E6] # ae
00C6 ; [.0E33.0020.0008.00C6][.0E8B.0020.0008.00C6] # AE
ENTRY

#    entry => <<'ENTRY', # for allkeys-3.1.1.txt
#00E6 ; [.0A15.0020.0002.00E6][.0A65.0020.0002.00E6] # ae
#00C6 ; [.0A15.0020.0008.00C6][.0A65.0020.0008.00C6] # AE
#ENTRY
);

BEGIN {

@listEs = qw(
    cambio camelo camella camello Camerún cielo curso
    chico chile Chile CHILE chocolate
    espacio espanto español esperanza lama líquido luz
    llama Llama LLAMA llamar nos nueve ñu ojo
);

@listFr = (
  qw(
    cadurcien cæcum cÆCUM CæCUM CÆCUM caennais cæsium cafard
    coercitif cote côte Côte coté Coté côté Côté coter
    élève élevé gène gêne MÂCON maçon
    pèche PÈCHE pêche PÊCHE péché PÉCHÉ pécher pêcher
    relève relevé révèle révélé
    surélévation sûrement suréminent sûreté
    vice-consul vicennal vice-président vice-roi vicésimal),
  "vice versa", "vice-versa",
);

use Test;
plan tests => $#listEs + 2 + $#listFr + 2;

}

sub randomize { my %hash; @[EMAIL PROTECTED] = (); keys %hash; } # ?!

for (my $i = 0; $i < $#listEs; $i++) {
    ok($objEs->lt($listEs[$i], $listEs[$i+1]));
}

for (my $i = 0; $i < $#listFr; $i++) {
    ok($objFr->lt($listFr[$i], $listFr[$i+1]));
}

our @randEs = randomize(@listEs);
our @sortEs = $objEs->sort(@randEs);

ok("@randEs" ne "@listEs");
ok("@sortEs" eq "@listEs");

our @randFr = randomize(@listFr);
our @sortFr = $objFr->sort(@randFr);

ok("@randFr" ne "@listFr");
ok("@sortFr" eq "@listFr");

__END__

Regards,
SADAHIRO Tomoyuki

Re: Unicode::Collate question

Reply via email to