Hi Paul,

On 11-12-07 10:29 AM, Roebuck,Paul L wrote:
Do this first and try again.

R>  Sys.setlocale("LC_COLLATE", "C")

OK I see it now (in ?Sys.setlocale):

  Sys.setlocale("LC_COLLATE", "C")   # turn off locale-specific sorting,
                                     #  usually

Thanks all for the answers!

I never really realized how far some collating sequence could go in
terms of counter-intuitiveness e.g. the fact that LC_COLLATE=en_CA.UTF-8
doesn't preserve the order of the strings when a common suffix is
added to them is scary. Also it's not that LC_COLLATE=en_CA.UTF-8
just ignores the '_' (underscores) and the '.' (dots), that can only be
the first pass, then it needs to break ties in a way that defines a
total order. So it looks like the exact definition of this collating
sequence is counter-intuitive and complicated.

Maybe that's just how things are and the developers that want
portability and reproducibility of their code are already putting
a Sys.setlocale("LC_COLLATE", "C") statement somewhere in their package
to force all their users to be on the same collating sequence.
It sounds a little bit drastic though and it might introduce some
conflicts with other packages.

So maybe a better approach is to only alter LC_COLLATE temporarily
inside the functions where it matters i.e. where the returned value
actually depends on the collating sequence? If I don't do this, then
there is no way I can write a test for my function because the
test would work for me but fail for someone else.

Actually this is the situation I was facing when I did my first post:
I have a function that downloads a list of sequences from the Ensembl
FTP server, sorts them by name, and returns them to the user. I have
a test for that function and the test was working for me when I was
doing

  tools::testInstalledPackage("MyPackage", "types="tests")

but it was failing when I was doing 'R CMD check'. It seems that
the latter alters LC_COLLATE before running the tests (maybe to
LC_COLLATE=C) but not the former. I fixed this by enforcing
LC_COLLATE=C inside my function.

A naive question: wouldn't everything be simpler if LC_COLLATE=C
was the default for everybody?

Thanks,
H.



On 12/7/11 3:41 AM, "Hervé Pagès"<hpa...@fhcrc.org>  wrote:

Hi,

This looks OK:

x<- c("_1_", "1_9", "2_9")
rank(x)
[1] 1 2 3

But this does not:

xa<- paste(x, "a", sep="")
xa
[1] "_1_a" "1_9a" "2_9a"
rank(xa)
[1] 2 1 3

Cheers,
H.

sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
   [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
   [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
   [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
   [7] LC_PAPER=C                 LC_NAME=C
   [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_2.14.0




--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to