Your 'nightmare' does seems specific to Mac OS. Your example is collated correctly in all the es_ES locales on my Linux box, and also in es_ES.UTF-8 on Solaris 10.

We have no idea what data you collected to assert 'whatever platform we use'. UTF-8 locales on Mac OS X are the only instance in C where I am aware of the use of Unicode point order (quite a few scripting languages do it, though). If the problem were widespread I would expect it to be reported more than it is (and 'ls' output is locale-specific in recent versions of Linux, and my IT team did get several help requests about that).

Collation is a tricky area, but that does not mean that OS designers are in general shy of it. There was a concerted project, the Unicode Collation Algorithm, and several OSes have implementations including national 'tailorings'.

What can be done about it? The obvious answer is to use a reliable OS. Alternatively, R is making use of the system's C collation functions and those could be replaced. In current R (>= 2.7.0) this is centralized in src/main/utils.c, in the code (not Windows)

# ifdef HAVE_STRCOLL
#  define STRCOLL strcoll
# else
#  define STRCOLL strcmp
# endif

int Scollate(SEXP a, SEXP b)
{
    return STRCOLL(translateChar(a), translateChar(b));
}

Mac OS X has strcoll (it is a C99 function, so that test is historical), and what would be needed would be to replace it by a more functional version. My suspicion is that Mac OS X does have proper collation functionality (http://en.wikipedia.org/wiki/Common_Locale_Data_Repository appears to claim it uses CDLR data), but that it is not used in the ISO C99 part of the OS. For example, Cocoa seems to have a function 'localizedCompare'.


BTW, Ei-ji Nakama was already replaced the broken wctype and wcwidth functions in Mac OS: see file src/main/rlocale.c


On Wed, 16 Apr 2008, [Ricardo Rodriguez] Your XEN ICT Team wrote:

Hi,

This issue comes from a thread of the same title, "a question of
alphabetical order", initiated yesterday in [EMAIL PROTECTED] list.
As it affects  now only Mac environment, I follow Brian Ripley's advice
and move it to this list.

It is now clear that ordering lists/variable values is a kind of
nightmare whatever platform we use. As I (and possible many others!)
need to get a right order, or an "as right as possible" order, for list
of strings using non-ASCII character, namely áéíóú, ÁÉÍÓÚ and ñ,Ñ, we
have been considering a number of options.

Hans-Joerg Bibiko proposed a customized function to do the trick. Brian
Ripley spoke about es_ES.ISO8859-15 doing almost the right thing for
these characters.

Here what I get working in a MacBook which environment I describe at the
bottom of the message:

http://mire.environmentalchange.net/~webmaster/images/toPlot.png

Here the code:

png(file="toPlot.png", pointsize = 14, width = 1000, height = 480, units
= "px", bg="#eaedd5")
Sys.setlocale(category = "LC_ALL", locale = "es_ES.ISO8859-15")
toPlot <- data.frame(medio=c("avión", "barco", "bicicleta", "ángulo",
"choco", "camión", "coche", "tren", "aleta", "luna", "llave"),
variable=c(34, 33, 3, 37, 54, 23, 67, 30, 23, 56, 13))
toPlot<-toPlot[order(toPlot$medio),]
Sys.setlocale(category = "LC_ALL", locale = "en_GB.UTF-8")
barplot(toPlot$variable,names.arg=toPlot$medio)
dev.off()

As you see in the order of labels, accent is not ignored, and ch and ll
are considered as single instances. These are not longer the case with
Spanish alphabetical order. It changed in 1994.

So, Hans's solution seems the only one available to the correct order.
At least working with in the environment described below.

In any case, please,

1. Are you aware of any new locale we could try to see if it is already
updated?
2. If it doesn't exist, how/where must we go to propose/start creating
such e locale?

Here the environment:

> version
              _
platform       i386-apple-darwin9.2.2
arch           i386
os             darwin9.2.2
system         i386, darwin9.2.2
status         beta
major          2
minor          7.0
year           2008
month          04
day            12
svn rev        45280
language       R
version.string R version 2.7.0 beta (2008-04-12 r45280)
> sessionInfo()
R version 2.7.0 beta (2008-04-12 r45280)
i386-apple-darwin9.2.2

locale:
en_GB.UTF-8/en_GB.UTF-8/C/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
>

R GUI 1.24-devel (5072)

Thank you so much for your help,

Ricardo

--
Ricardo Rodríguez
Your XEN ICT Team

_______________________________________________
R-SIG-Mac mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-mac


--
Brian D. Ripley,                  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
_______________________________________________
R-SIG-Mac mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

Reply via email to