Your 'nightmare' does seems specific to Mac OS. Your example is collated
correctly in all the es_ES locales on my Linux box, and also in
es_ES.UTF-8 on Solaris 10.
We have no idea what data you collected to assert 'whatever platform we
use'. UTF-8 locales on Mac OS X are the only instance in C where I am
aware of the use of Unicode point order (quite a few scripting languages
do it, though). If the problem were widespread I would expect it to be
reported more than it is (and 'ls' output is locale-specific in recent
versions of Linux, and my IT team did get several help requests about
that).
Collation is a tricky area, but that does not mean that OS designers are
in general shy of it. There was a concerted project, the Unicode
Collation Algorithm, and several OSes have implementations including
national 'tailorings'.
What can be done about it? The obvious answer is to use a reliable OS.
Alternatively, R is making use of the system's C collation functions and
those could be replaced. In current R (>= 2.7.0) this is centralized in
src/main/utils.c, in the code (not Windows)
# ifdef HAVE_STRCOLL
# define STRCOLL strcoll
# else
# define STRCOLL strcmp
# endif
int Scollate(SEXP a, SEXP b)
{
return STRCOLL(translateChar(a), translateChar(b));
}
Mac OS X has strcoll (it is a C99 function, so that test is historical),
and what would be needed would be to replace it by a more functional
version. My suspicion is that Mac OS X does have proper collation
functionality (http://en.wikipedia.org/wiki/Common_Locale_Data_Repository
appears to claim it uses CDLR data), but that it is not used in the ISO
C99 part of the OS. For example, Cocoa seems to have a function
'localizedCompare'.
BTW, Ei-ji Nakama was already replaced the broken wctype and wcwidth
functions in Mac OS: see file src/main/rlocale.c
On Wed, 16 Apr 2008, [Ricardo Rodriguez] Your XEN ICT Team wrote:
Hi,
This issue comes from a thread of the same title, "a question of
alphabetical order", initiated yesterday in [EMAIL PROTECTED] list.
As it affects now only Mac environment, I follow Brian Ripley's advice
and move it to this list.
It is now clear that ordering lists/variable values is a kind of
nightmare whatever platform we use. As I (and possible many others!)
need to get a right order, or an "as right as possible" order, for list
of strings using non-ASCII character, namely áéíóú, ÁÉÍÓÚ and ñ,Ñ, we
have been considering a number of options.
Hans-Joerg Bibiko proposed a customized function to do the trick. Brian
Ripley spoke about es_ES.ISO8859-15 doing almost the right thing for
these characters.
Here what I get working in a MacBook which environment I describe at the
bottom of the message:
http://mire.environmentalchange.net/~webmaster/images/toPlot.png
Here the code:
png(file="toPlot.png", pointsize = 14, width = 1000, height = 480, units
= "px", bg="#eaedd5")
Sys.setlocale(category = "LC_ALL", locale = "es_ES.ISO8859-15")
toPlot <- data.frame(medio=c("avión", "barco", "bicicleta", "ángulo",
"choco", "camión", "coche", "tren", "aleta", "luna", "llave"),
variable=c(34, 33, 3, 37, 54, 23, 67, 30, 23, 56, 13))
toPlot<-toPlot[order(toPlot$medio),]
Sys.setlocale(category = "LC_ALL", locale = "en_GB.UTF-8")
barplot(toPlot$variable,names.arg=toPlot$medio)
dev.off()
As you see in the order of labels, accent is not ignored, and ch and ll
are considered as single instances. These are not longer the case with
Spanish alphabetical order. It changed in 1994.
So, Hans's solution seems the only one available to the correct order.
At least working with in the environment described below.
In any case, please,
1. Are you aware of any new locale we could try to see if it is already
updated?
2. If it doesn't exist, how/where must we go to propose/start creating
such e locale?
Here the environment:
> version
_
platform i386-apple-darwin9.2.2
arch i386
os darwin9.2.2
system i386, darwin9.2.2
status beta
major 2
minor 7.0
year 2008
month 04
day 12
svn rev 45280
language R
version.string R version 2.7.0 beta (2008-04-12 r45280)
> sessionInfo()
R version 2.7.0 beta (2008-04-12 r45280)
i386-apple-darwin9.2.2
locale:
en_GB.UTF-8/en_GB.UTF-8/C/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
>
R GUI 1.24-devel (5072)
Thank you so much for your help,
Ricardo
--
Ricardo Rodríguez
Your XEN ICT Team
_______________________________________________
R-SIG-Mac mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-mac
--
Brian D. Ripley, [EMAIL PROTECTED]
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
_______________________________________________
R-SIG-Mac mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-mac