Duncan and Martin and ...,

There are multiple groups where people discuss R and this seems to be a help 
group. The topic keeps coming up as to whether you should teach anything other 
than base R and I claim it depends. 

Many packages are indeed written mostly using base R or using other packages 
that ultimately use only base R constructions. So, obviously, those packages 
can be gotten around by re-writing your own complicated code to do the things 
you want. Some have parts written using external libraries such as parts 
written in C and that makes it harder.

I think the debate is not one-sided and the whole purpose of creating and 
sharing libraries is to make life easier for those who see tools that already 
exist and perhaps may even be tested by others and deemed worthy. For people 
with serious programming needs, I suggest often 80% of your code can easily be 
made easier or more powerful AND easier to read than writing many times more 
fairly complex code in the base language.

BUT, if someone is teaching a language like R one step at a time and a student 
asks a question, it is very wrong to tell them of a package to do all the work 
when clearly they need practice using a small subset of base functionality. If 
they are asked to implement a sort algorithm using say a recursive merge sort, 
then the assignment involves using functions that call themselves and so on.

So I was thinking of how I might have dealt with finding unique members of a 
vector (called "temp" below) that contained character text as in "2.2" and 
"99.6" and "10.1" and "1.1" if they are all numeric POTENTIALLY and you want 
them sorted as if they were numeric but then returned as character data. My 
first attempt at it was this:

as.character(sort(as.numeric(unique(temp))))

Then I considered that if there was no guarantee that everything in temp could 
be coerced by as.numeric() then do I handle that? I see the following extension 
of temp causes an error and temp2 is not created:

  > as.numeric(c(temp. "blab")) -> temp2
  Error: unexpected string constant in "as.numeric(c(temp. "blab""

So clearly to handle that I might change the one-liner into multiple lines and 
check for the above error in one of many ways such as embedding it inside a 
try() but then what is the requirement if it fails? One possibility might be to 
remove such entries and another is to add them back after sorting the rest but 
add them back where? If there are many, they all need to be sorted. Or should 
they become an NA, which is tolerated?

Get the idea?

It can take non-trivial work. Now if the package was made carefully, perhaps 
with options letting you specify among such choices as above, and it is 
trusted, why not use that if your needs may be complex?

One final note. In some places the functionality of unique() is done by a 
method that first sorts the data and then simply removes duplicates of the same 
thing that are adjacent. In those cases, the data would come out sorted or in 
reverse sort order. But that is not what the built-in unique() does as it 
basically keeps a list of what it has encountered and each new item is kept 
only if it is not already in the list. This produces output in the same order 
it entered. Think of how factors normally are done in R. The first item found 
is stored usually as a 1 then the next unique item as a 2 and so on but 
duplicates all share the same integer value and thus the levels only need be 
stored once. If you want them stored sorted, there are fairly easy ways to do 
that but there are packages like forcats that do such things often more 
consistently and easier. In a sense, you can do a unique by simply making your 
data into a factor and ask for levels(whatever) ad optionally sort them. 
Effectively, you are using it as a kind of set, and there are ways to also use 
sets in R.

-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of Duncan Murdoch
Sent: Monday, December 20, 2021 12:51 PM
To: Martin Maechler <maech...@stat.math.ethz.ch>; Rui Barradas 
<ruipbarra...@sapo.pt>
Cc: Stephen H. Dawson, DSL via R-help <r-help@r-project.org>
Subject: Re: [R] Adding SORT to UNIQUE

On 20/12/2021 12:32 p.m., Martin Maechler wrote:
>>>>>> Rui Barradas
>>>>>>      on Mon, 20 Dec 2021 17:05:33 +0000 writes:
> 
>      > Hello,
>      > Package stringr has functions str_sort and str_order, both with an
>      > argument 'numeric' that will sort the numbers correctly.
>      > Maybe that's what you are looking for, see the example below.
> 
> 
>      > x <- sample(sprintf("ab%d", 1:20))     # shuffle the vector
>      > stringr::str_sort(x, numeric = TRUE)   # sort considering the numbers
> 
> Again:
> There's really no need to use non-base R here (and in almost all such 
> questions about string handling!) as Avi Gross' answer shows.

That gives a different sort order:

  stringr::str_sort(x, numeric = TRUE)

gives

  [1] "ab1"  "ab2"  "ab3"  "ab4"  "ab5"  "ab6"  "ab7"  "ab8"  "ab9" 
"ab10" "ab11" "ab12" "ab13" "ab14" "ab15" "ab16" "ab17"
[18] "ab18" "ab19" "ab20"

(with the numbers in order), while sort(x) gives

  [1] "ab1"  "ab10" "ab11" "ab12" "ab13" "ab14" "ab15" "ab16" "ab17" 
"ab18" "ab19" "ab2"  "ab20" "ab3"  "ab4"  "ab5"  "ab6"
[18] "ab7"  "ab8"  "ab9"

with the characters in order.  I don't think the "numeric" option is available 
in base R (though of course you could write a function to do it, so there's no 
*need*, but it's certainly more convenient to use the stringr function if 
that's the order you want).

Duncan Murdoch


> 
> 
>      > Hope this helps,
> 
>      > Rui Barradas
> 
> 
>      > Às 16:58 de 20/12/21, Stephen H. Dawson, DSL via R-help escreveu:
>      >> Hi,
>      >>
>      >>
>      >> Running a simple syntax set to review entries in dataframe columns. 
> Here
>      >> is the working code.
>      >>
>      >> Data <- read.csv("./input/Source.csv", header=T)
>      >> describe(Data)
>      >> summary(Data)
>      >> unique(Data[1])
>      >> unique(Data[2])
>      >> unique(Data[3])
>      >> unique(Data[4])
>      >>
>      >> I would like to add sort the unique entries. The data in the various
>      >> columns are not defined as numbers, but also text. I realize 1 and 10
>      >> will not sort properly, as the column is not defined as a number, but
>      >> want to see what I have in the columns viewed as sorted.
>      >>
>      >> QUESTION
>      >> What is the best process to sort unique output, please?
>      >>
>      >>
>      >> Thanks.
> 
>      > ______________________________________________
>      > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>      > https://stat.ethz.ch/mailman/listinfo/r-help
>      > PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
>      > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to