Re: [R] How to create a new column based on the values from multiple columns which are matching a particular string?

Dénes Tóth Mon, 29 Jul 2019 16:34:04 -0700

Hi Bert,

see inline.


On 7/30/19 1:12 AM, Bert Gunter wrote:

While Eric's solution is correct( mod "corner" cases like all NA's in
a row), it can be made considerably more efficient.

One minor improvement can be made by using the idiom
any(x == "A")
instead of matching via %in% for the simple case of matching just a
single value.

However, a considerable improvement can be made by getting fancy,
taking advantage of do.call() and the pmax() function to mostly
vectorize the calculation. Here are the details and timing on a large
data frame.

(Note: I removed the names in the %in% approach for simplicity. It has
almost no effect on timings.
I also moved the as.integer() call out of the function so that it is
called only once at the end, which improves efficiency a bit)

1. Eric's original:
fun1 <-function(df,what)
{
   as.integer(unname(apply(df,MARGIN = 1,function(v) { what %in% v })))
}

2. Using any( x == "A") instead:
fun2 <- function(df,what)
{
    as.integer(unname(apply(df,MARGIN =1, function(x)any(x == what,
na.rm=TRUE))))
}

3. Getting fancy to use pmax()
fun3 <- function(df,what)
{
    z <- lapply(df,function(x)as.integer((x==what)))
    do.call(pmax,c(z,na.rm=TRUE))
}

Here are the timings:

bigdf <- df[rep(1:10,1e4), rep(1:5, 50)]
dim(bigdf)

[1] 100000    250

system.time(res1 <- fun1(bigdf, "A"))

    user  system elapsed
   2.204   0.432   2.637


system.time(res2 <- fun2(bigdf, "A"))

    user  system elapsed
   1.898   0.403   2.302


system.time(res3 <- fun3(bigdf, "A"))

    user  system elapsed
   0.187   0.048   0.235

## 10 times faster!


all.equal(res1,res2)

[1] TRUE

all.equal(res1,res3)

[1] TRUE


NB: I freely admit that Eric's original solution may well be perfectly
adequate, and the speed improvement is pointless. In that case, maybe
this is at least somewhat instructive for someone.

Nevertheless, I would welcome further suggestions for improvement, as
I suspect my "fancy" approach is still a ways from what one can do (in
R code, without resorting to C++).


fun4 <- function(df, what)
{
  as.integer(rowSums(df == what, na.rm = TRUE) > 0)
}

The function above works for data.frame and matrix inputs as well. It isslower than fun3() if 'df' is a data.frame, but is faster if 'df' is amatrix (which is a more efficient representation of the data if itcontains only character columns).

A note to Ana: 'df' is the name of a function in R (see ?stats::df); nota perfect choice for a variable name.


Cheers,
Denes


Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Jul 29, 2019 at 12:38 PM Eric Berger <[email protected]> wrote:


Read the help for apply and %in%

?apply
?%in%


Sent from my iPhone

On 29 Jul 2019, at 22:23, Ana Marija <[email protected]> wrote:

Thank you so much! Just to confirm here MARGIN=1 indicates that "A" should 
appear at least once per row?

On Mon, Jul 29, 2019 at 1:53 PM Eric Berger <[email protected]> wrote:
df$case <- apply(df,MARGIN = 1,function(v) { as.integer("A" %in% v) })

On Mon, Jul 29, 2019 at 9:02 PM Ana Marija <[email protected]> wrote:
sorry my bad, here is the edited version:

so the data frame is this:

df=data.frame(
   eye_problemsdisorders_f6148_0_1=c("A","C","D",NA,"D","A","C",NA,"B","A"),
   eye_problemsdisorders_f6148_0_2=c("B","C",NA,"A","C","B",NA,NA,"A","D"),
   eye_problemsdisorders_f6148_0_3=c("C","A","D","D","B","A",NA,NA,"A","B"),
   eye_problemsdisorders_f6148_0_4=c("D","D",NA,"B","A","C",NA,"C","A","B"),
   eye_problemsdisorders_f6148_0_5=c("C","C",NA,"D","B","C",NA,"D","D","B")

and I would need to put inside the column which would be named "case" and 
values inside would be: 1,1,0,1,1,1,0,0,1,1

so "case" column is where value "A" can be found in any column.

On Mon, Jul 29, 2019 at 12:53 PM Eric Berger <[email protected]> wrote:
You may have a typo/misstatement in your question.
You define a data frame with 5 columns, each of which has 10 elements, so your 
data frame has dimensions 10 x 5.
Then you request a new COLUMN which will have only 5 elements, which is not 
allowed. All columns of a data frame
must have the same length.

On Mon, Jul 29, 2019 at 8:42 PM Ana Marija <[email protected]> wrote:
I have data frame which looks like this:

df=data.frame(
   eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A),
   eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D),
   eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B),
   eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B),
   eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B))

In reality I have much more columns and they don't always match
"eye_problemsdisorders_f6148" this string, and there is much more rows.

What I would like to do is create a new column, say named "case" where I
would have value "1" for every row where string "A" appears at least once
in any column, if not the value would be "0". So in the above example
column "case" would have these values: 1,1,1,1,0
Thanks
Ana

         [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


         [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to create a new column based on the values from multiple columns which are matching a particular string?

Reply via email to