Thanks, Dénes. Good points. Bert
On Mon, Jul 29, 2019 at 4:33 PM Dénes Tóth <[email protected]> wrote: > > Hi Bert, > > see inline. > > On 7/30/19 1:12 AM, Bert Gunter wrote: > > While Eric's solution is correct( mod "corner" cases like all NA's in > > a row), it can be made considerably more efficient. > > > > One minor improvement can be made by using the idiom > > any(x == "A") > > instead of matching via %in% for the simple case of matching just a > > single value. > > > > However, a considerable improvement can be made by getting fancy, > > taking advantage of do.call() and the pmax() function to mostly > > vectorize the calculation. Here are the details and timing on a large > > data frame. > > > > (Note: I removed the names in the %in% approach for simplicity. It has > > almost no effect on timings. > > I also moved the as.integer() call out of the function so that it is > > called only once at the end, which improves efficiency a bit) > > > > 1. Eric's original: > > fun1 <-function(df,what) > > { > > as.integer(unname(apply(df,MARGIN = 1,function(v) { what %in% v }))) > > } > > > > 2. Using any( x == "A") instead: > > fun2 <- function(df,what) > > { > > as.integer(unname(apply(df,MARGIN =1, function(x)any(x == what, > > na.rm=TRUE)))) > > } > > > > 3. Getting fancy to use pmax() > > fun3 <- function(df,what) > > { > > z <- lapply(df,function(x)as.integer((x==what))) > > do.call(pmax,c(z,na.rm=TRUE)) > > } > > > > Here are the timings: > > > >> bigdf <- df[rep(1:10,1e4), rep(1:5, 50)] > >> dim(bigdf) > > [1] 100000 250 > > > >> system.time(res1 <- fun1(bigdf, "A")) > > user system elapsed > > 2.204 0.432 2.637 > >> > >> system.time(res2 <- fun2(bigdf, "A")) > > user system elapsed > > 1.898 0.403 2.302 > >> > >> system.time(res3 <- fun3(bigdf, "A")) > > user system elapsed > > 0.187 0.048 0.235 > > > > ## 10 times faster! > > > >> > >> all.equal(res1,res2) > > [1] TRUE > >> all.equal(res1,res3) > > [1] TRUE > > > > > > NB: I freely admit that Eric's original solution may well be perfectly > > adequate, and the speed improvement is pointless. In that case, maybe > > this is at least somewhat instructive for someone. > > > > Nevertheless, I would welcome further suggestions for improvement, as > > I suspect my "fancy" approach is still a ways from what one can do (in > > R code, without resorting to C++). > > fun4 <- function(df, what) > { > as.integer(rowSums(df == what, na.rm = TRUE) > 0) > } > > The function above works for data.frame and matrix inputs as well. It is > slower than fun3() if 'df' is a data.frame, but is faster if 'df' is a > matrix (which is a more efficient representation of the data if it > contains only character columns). > > A note to Ana: 'df' is the name of a function in R (see ?stats::df); not > a perfect choice for a variable name. > > Cheers, > Denes > > > > > > Cheers, > > Bert > > > > > > Bert Gunter > > > > "The trouble with having an open mind is that people keep coming along > > and sticking things into it." > > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > > Bert Gunter > > > > "The trouble with having an open mind is that people keep coming along > > and sticking things into it." > > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > > > > On Mon, Jul 29, 2019 at 12:38 PM Eric Berger <[email protected]> wrote: > >> > >> Read the help for apply and %in% > >> > >> ?apply > >> ?%in% > >> > >> > >> Sent from my iPhone > >> > >>> On 29 Jul 2019, at 22:23, Ana Marija <[email protected]> wrote: > >>> > >>> Thank you so much! Just to confirm here MARGIN=1 indicates that "A" > >>> should appear at least once per row? > >>> > >>>> On Mon, Jul 29, 2019 at 1:53 PM Eric Berger <[email protected]> > >>>> wrote: > >>>> df$case <- apply(df,MARGIN = 1,function(v) { as.integer("A" %in% v) }) > >>>> > >>>> > >>>>> On Mon, Jul 29, 2019 at 9:02 PM Ana Marija > >>>>> <[email protected]> wrote: > >>>>> sorry my bad, here is the edited version: > >>>>> > >>>>> so the data frame is this: > >>>>> > >>>>> df=data.frame( > >>>>> > >>>>> eye_problemsdisorders_f6148_0_1=c("A","C","D",NA,"D","A","C",NA,"B","A"), > >>>>> > >>>>> eye_problemsdisorders_f6148_0_2=c("B","C",NA,"A","C","B",NA,NA,"A","D"), > >>>>> > >>>>> eye_problemsdisorders_f6148_0_3=c("C","A","D","D","B","A",NA,NA,"A","B"), > >>>>> > >>>>> eye_problemsdisorders_f6148_0_4=c("D","D",NA,"B","A","C",NA,"C","A","B"), > >>>>> > >>>>> eye_problemsdisorders_f6148_0_5=c("C","C",NA,"D","B","C",NA,"D","D","B") > >>>>> > >>>>> and I would need to put inside the column which would be named "case" > >>>>> and values inside would be: 1,1,0,1,1,1,0,0,1,1 > >>>>> > >>>>> so "case" column is where value "A" can be found in any column. > >>>>> > >>>>>> On Mon, Jul 29, 2019 at 12:53 PM Eric Berger <[email protected]> > >>>>>> wrote: > >>>>>> You may have a typo/misstatement in your question. > >>>>>> You define a data frame with 5 columns, each of which has 10 elements, > >>>>>> so your data frame has dimensions 10 x 5. > >>>>>> Then you request a new COLUMN which will have only 5 elements, which > >>>>>> is not allowed. All columns of a data frame > >>>>>> must have the same length. > >>>>>> > >>>>>>> On Mon, Jul 29, 2019 at 8:42 PM Ana Marija > >>>>>>> <[email protected]> wrote: > >>>>>>> I have data frame which looks like this: > >>>>>>> > >>>>>>> df=data.frame( > >>>>>>> eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A), > >>>>>>> eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D), > >>>>>>> eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B), > >>>>>>> eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B), > >>>>>>> eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B)) > >>>>>>> > >>>>>>> In reality I have much more columns and they don't always match > >>>>>>> "eye_problemsdisorders_f6148" this string, and there is much more > >>>>>>> rows. > >>>>>>> > >>>>>>> What I would like to do is create a new column, say named "case" > >>>>>>> where I > >>>>>>> would have value "1" for every row where string "A" appears at least > >>>>>>> once > >>>>>>> in any column, if not the value would be "0". So in the above example > >>>>>>> column "case" would have these values: 1,1,1,1,0 > >>>>>>> Thanks > >>>>>>> Ana > >>>>>>> > >>>>>>> [[alternative HTML version deleted]] > >>>>>>> > >>>>>>> ______________________________________________ > >>>>>>> [email protected] mailing list -- To UNSUBSCRIBE and more, see > >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>> PLEASE do read the posting guide > >>>>>>> http://www.R-project.org/posting-guide.html > >>>>>>> and provide commented, minimal, self-contained, reproducible code. > >> > >> [[alternative HTML version deleted]] > >> > >> ______________________________________________ > >> [email protected] mailing list -- To UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > [email protected] mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ [email protected] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

