Hi Bert, see inline.
On 7/30/19 1:12 AM, Bert Gunter wrote:
While Eric's solution is correct( mod "corner" cases like all NA's in a row), it can be made considerably more efficient. One minor improvement can be made by using the idiom any(x == "A") instead of matching via %in% for the simple case of matching just a single value. However, a considerable improvement can be made by getting fancy, taking advantage of do.call() and the pmax() function to mostly vectorize the calculation. Here are the details and timing on a large data frame. (Note: I removed the names in the %in% approach for simplicity. It has almost no effect on timings. I also moved the as.integer() call out of the function so that it is called only once at the end, which improves efficiency a bit) 1. Eric's original: fun1 <-function(df,what) { as.integer(unname(apply(df,MARGIN = 1,function(v) { what %in% v }))) } 2. Using any( x == "A") instead: fun2 <- function(df,what) { as.integer(unname(apply(df,MARGIN =1, function(x)any(x == what, na.rm=TRUE)))) } 3. Getting fancy to use pmax() fun3 <- function(df,what) { z <- lapply(df,function(x)as.integer((x==what))) do.call(pmax,c(z,na.rm=TRUE)) } Here are the timings:bigdf <- df[rep(1:10,1e4), rep(1:5, 50)] dim(bigdf)[1] 100000 250system.time(res1 <- fun1(bigdf, "A"))user system elapsed 2.204 0.432 2.637system.time(res2 <- fun2(bigdf, "A"))user system elapsed 1.898 0.403 2.302system.time(res3 <- fun3(bigdf, "A"))user system elapsed 0.187 0.048 0.235 ## 10 times faster!all.equal(res1,res2)[1] TRUEall.equal(res1,res3)[1] TRUE NB: I freely admit that Eric's original solution may well be perfectly adequate, and the speed improvement is pointless. In that case, maybe this is at least somewhat instructive for someone. Nevertheless, I would welcome further suggestions for improvement, as I suspect my "fancy" approach is still a ways from what one can do (in R code, without resorting to C++).
fun4 <- function(df, what)
{
as.integer(rowSums(df == what, na.rm = TRUE) > 0)
}
The function above works for data.frame and matrix inputs as well. It is
slower than fun3() if 'df' is a data.frame, but is faster if 'df' is a
matrix (which is a more efficient representation of the data if it
contains only character columns).
A note to Ana: 'df' is the name of a function in R (see ?stats::df); not a perfect choice for a variable name.
Cheers, Denes
Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Jul 29, 2019 at 12:38 PM Eric Berger <[email protected]> wrote:Read the help for apply and %in% ?apply ?%in% Sent from my iPhoneOn 29 Jul 2019, at 22:23, Ana Marija <[email protected]> wrote: Thank you so much! Just to confirm here MARGIN=1 indicates that "A" should appear at least once per row?On Mon, Jul 29, 2019 at 1:53 PM Eric Berger <[email protected]> wrote: df$case <- apply(df,MARGIN = 1,function(v) { as.integer("A" %in% v) })On Mon, Jul 29, 2019 at 9:02 PM Ana Marija <[email protected]> wrote: sorry my bad, here is the edited version: so the data frame is this: df=data.frame( eye_problemsdisorders_f6148_0_1=c("A","C","D",NA,"D","A","C",NA,"B","A"), eye_problemsdisorders_f6148_0_2=c("B","C",NA,"A","C","B",NA,NA,"A","D"), eye_problemsdisorders_f6148_0_3=c("C","A","D","D","B","A",NA,NA,"A","B"), eye_problemsdisorders_f6148_0_4=c("D","D",NA,"B","A","C",NA,"C","A","B"), eye_problemsdisorders_f6148_0_5=c("C","C",NA,"D","B","C",NA,"D","D","B") and I would need to put inside the column which would be named "case" and values inside would be: 1,1,0,1,1,1,0,0,1,1 so "case" column is where value "A" can be found in any column.On Mon, Jul 29, 2019 at 12:53 PM Eric Berger <[email protected]> wrote: You may have a typo/misstatement in your question. You define a data frame with 5 columns, each of which has 10 elements, so your data frame has dimensions 10 x 5. Then you request a new COLUMN which will have only 5 elements, which is not allowed. All columns of a data frame must have the same length.On Mon, Jul 29, 2019 at 8:42 PM Ana Marija <[email protected]> wrote: I have data frame which looks like this: df=data.frame( eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A), eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D), eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B), eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B), eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B)) In reality I have much more columns and they don't always match "eye_problemsdisorders_f6148" this string, and there is much more rows. What I would like to do is create a new column, say named "case" where I would have value "1" for every row where string "A" appears at least once in any column, if not the value would be "0". So in the above example column "case" would have these values: 1,1,1,1,0 Thanks Ana [[alternative HTML version deleted]] ______________________________________________ [email protected] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]] ______________________________________________ [email protected] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.______________________________________________ [email protected] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ [email protected] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

