Re: Unicode::Collate question

Jarkko Hietaniemi Mon, 01 Dec 2003 10:43:28 -0800

Ok, this is in line with what how I understood this paragraph in perluniintro:

The short answer is that by default, Perl compares strings ("lt", "le", "cmp", "ge", "gt") based only on the code points of the char- acters. In the above case, the answer is "after", since 0x00C1 > 0x00C0.

So is it just by chance that these French words are accurately sorted?

I think a "qualified yes" here is in order...

% perl -Mutf8 -e 'binmode(STDOUT, ":utf8"); print join " ", sort qw(côte côté cote coté)' cote coté côte côté


Is this the famous French "backwards accents" rule in action?
(http://www-clips.imag.fr/geta/gilles.serasset/tri-du-francais.html)
(no, I don't speak French)

But in this case, with those particular words, I think ISO Latin 1 (none
of the characters are beyond ISO Latin 1) just "happens" to work right.
o < ô, and e < é.

Some more links (database related since they have had to think about these things for years already) that hopefully explain some of the problems related to "linguistic sorting":

http://www.engin.umich.edu/caen/wls/software/oracle/server.901/a90236/ ch4.htm http://developer.mimer.com/documentation/html_92/ Mimer_SQL_Engine_DocSet/Mimer_Concepts14.html


Thanks,
--
Eric Cholet

-- Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

Re: Unicode::Collate question

Reply via email to