On 08 Nov 2016, at 13:18 , Pascal A. Niklaus <pascal.nikl...@ieu.uzh.ch> wrote:

> I just got caught by the way in character vectors are sorted.
> 
> It seems that on my machine "sort" (and related functions like "order") only 
> consider characters related to punctuation (at least here the "+" and "-") 
> when there is no difference in the remaining characters:
> 
> > x1 <- c("-A","+A")
> > x2 <- c("+A","-A")
> > sort(x1)    # sorting is according to "-" and "+"
> [1] "-A" "+A"
> > sort(x2)
> [1] "-A" "+A"
> 
> > x3 <- c("-Aa","-Ab")
> > x4 <- c("-Aa","+Ab")
> > x5 <- c("+Aa","-Ab")
> > sort(x3)
> [1] "-Aa" "-Ab" # here the "+" and "-" are ignored
> > sort(x4)
> [1] "-Aa" "+Ab"
> > sort(x5)
> [1] "+Aa" "-Ab"
> 
> I understand from the help that this depends on how characters are collated, 
> and that this scheme follows the multi-level comparison in unicode 
> (http://www.unicode.org/reports/tr10/).
> 
> However, what I need is a strict left-to-right comparison of the sort 
> provided by strcmp or wcscmp in glibc. The particular ordering of special 
> characters is not so important, but there should be no "multi-level" aspect 
> to the sorting.
> 
> Is there a way to achieve this in R?
> 

I'd try one of two ways (the above is not happening for me, so I cannot test):

(1) Temporarily set the Locale to "C": Sys.setlocale("LC_COLLATE", "C"). That 
should work as long as you stay in good ol' ASCII.
(2) Figure out (Don't look at me!) how to diddle the ICU settings for your 
system, icuSetCollate() is claimed to be your friend.

-pd


> Thanks for your help
> 
> Pascal
> 
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd....@cbs.dk  Priv: pda...@gmail.com

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to