> Has anyone had a look at the OpenI18N/ICU locale data? > > The locales there are all UTF-8 and have java rule based collation data, so > they *might* be useful for creating a more comprehensive (and accurate) set > of sort modules? The downside is this data is pretty rough ATM but does > seem to be improving slowly. > > I guess p6 is going to use ICU as the basis for I18N - sure hope the APIs > are easier though :)
The syntax of collation customization (tailoring) in ICU ( http://oss.software.ibm.com/icu/userguide/Collate_Customization.html ) is character-based and may be more intuitive: for French: "[backwards 2]&A << \u00e6/e <<< \u00c6/E" for Spanish: "&N < n\u0303 <<< N\u0303" "&C < ch <<< Ch <<< CH" "&l < ll <<< Ll <<< LL" However Unicode::Collate also allows linguistic tailoring. Certainly its interface requires hard code of weights and may be less user-friendly. #!perl use strict; use warnings; use Unicode::Collate; our (@listEs, @listFr); my $objEs = Unicode::Collate->new( entry => <<'ENTRY', # for allkeys-4.0.0.txt 0063 0068 ; [.0E6A.0020.0002.0063] # ch 0043 0068 ; [.0E6A.0020.0007.0043] # Ch 0043 0048 ; [.0E6A.0020.0008.0043] # Ch 006C 006C ; [.0F4C.0020.0002.006C] # ll 004C 006C ; [.0F4C.0020.0007.004C] # Ll 004C 004C ; [.0F4C.0020.0008.004C] # LL 006E 0303 ; [.0F69.0020.0002.006E] # ñ 004E 0303 ; [.0F69.0020.0008.004E] # Ñ ENTRY # entry => <<'ENTRY', # for allkeys-3.1.1.txt #0063 0068 ; [.0A46.0020.0002.0063] # ch #0043 0068 ; [.0A46.0020.0007.0043] # Ch #0043 0048 ; [.0A46.0020.0008.0043] # Ch #006C 006C ; [.0B1C.0020.0002.006C] # ll #004C 006C ; [.0B1C.0020.0007.004C] # Ll #004C 004C ; [.0B1C.0020.0008.004C] # LL #006E 0303 ; [.0B38.0020.0002.006E] # ñ #004E 0303 ; [.0B38.0020.0008.004E] # Ñ #ENTRY ); my $objFr = Unicode::Collate->new( backwards => 2, entry => <<'ENTRY', # for allkeys-4.0.0.txt 00E6 ; [.0E33.0020.0002.00E6][.0E8B.0020.0002.00E6] # ae 00C6 ; [.0E33.0020.0008.00C6][.0E8B.0020.0008.00C6] # AE ENTRY # entry => <<'ENTRY', # for allkeys-3.1.1.txt #00E6 ; [.0A15.0020.0002.00E6][.0A65.0020.0002.00E6] # ae #00C6 ; [.0A15.0020.0008.00C6][.0A65.0020.0008.00C6] # AE #ENTRY ); BEGIN { @listEs = qw( cambio camelo camella camello Camerún cielo curso chico chile Chile CHILE chocolate espacio espanto español esperanza lama líquido luz llama Llama LLAMA llamar nos nueve ñu ojo ); @listFr = ( qw( cadurcien cæcum cÆCUM CæCUM CÆCUM caennais cæsium cafard coercitif cote côte Côte coté Coté côté Côté coter élève élevé gène gêne MÂCON maçon pèche PÈCHE pêche PÊCHE péché PÉCHÉ pécher pêcher relève relevé révèle révélé surélévation sûrement suréminent sûreté vice-consul vicennal vice-président vice-roi vicésimal), "vice versa", "vice-versa", ); use Test; plan tests => $#listEs + 2 + $#listFr + 2; } sub randomize { my %hash; @[EMAIL PROTECTED] = (); keys %hash; } # ?! for (my $i = 0; $i < $#listEs; $i++) { ok($objEs->lt($listEs[$i], $listEs[$i+1])); } for (my $i = 0; $i < $#listFr; $i++) { ok($objFr->lt($listFr[$i], $listFr[$i+1])); } our @randEs = randomize(@listEs); our @sortEs = $objEs->sort(@randEs); ok("@randEs" ne "@listEs"); ok("@sortEs" eq "@listEs"); our @randFr = randomize(@listFr); our @sortFr = $objFr->sort(@randFr); ok("@randFr" ne "@listFr"); ok("@sortFr" eq "@listFr"); __END__ Regards, SADAHIRO Tomoyuki