>> dLatestVisit <- dSorted[!duplicated(dSorted$Name), ] > >I guess it is faster, but who knows?
You can find out by making a function that generates datasets of various sizes and timing the suggested algorithms. E.g., makeData <- function(nPatients, aveVisitsPerPatient, uniqueNameDate = TRUE){ nrow <- trunc(nPatients * aveVisitsPerPatient) patientNames <- paste0("P",seq_len(nPatients)) possibleDates <- as.Date(16001:17000, origin=as.Date("1970-01-01")) possibleTemps <- seq(97, 103, by=0.1) data <- data.frame(Name=sample(patientNames, replace=TRUE, size=nrow), CheckInDate=sample(possibleDates, replace=TRUE, size=nrow), Temp=sample(possibleTemps, replace=TRUE, size=nrow)) if (uniqueNameDate) { data <- data[!duplicated(data[, c("Name", "CheckInDate")]), ] } data } funs <- list( f1 = function(data) { do.call(rbind, lapply(split(data, data$Name), function(x) x[order(x$CheckInDate),][nrow(x),])) }, f2 = function (d) { isEndOfRun <- function(x) c(x[-1] != x[-length(x)], TRUE) dSorted <- d[order(d$Name, d$CheckInDate), ] dSorted[isEndOfRun(dSorted$Name), ] }, f3 = function (d) { # is the following how you did reverse sort on date (& fwd on name)? # Too bad that order's decreasing arg is not vectorized dSorted <- d[order(d$Name, -as.numeric(d$CheckInDate)), ] dSorted[!duplicated(dSorted$Name), ] }, f4 = function(dta) { dta %>% group_by(Name) %>% filter(CheckInDate==max(CheckInDate)) }) D <- makeData(nPatients=35000, aveVisitsPerPatient=3.7) # c. 129000 visits library(dplyr) Z <- lapply(funs, function(fun){ time <- system.time( result <- fun(D) ) ; list(time=time, result=result) }) sapply(Z, function(x)x$time) # f1 f2 f3 f4 #user.self 461.25 0.47 0.36 3.01 #sys.self 1.20 0.00 0.00 0.01 #elapsed 472.33 0.47 0.39 3.03 #user.child NA NA NA NA #sys.child NA NA NA NA # duplicated is a bit better than diff, dplyr rather slower, rbind much slower. equivResults <- function(a, b) { # results have different classes and different orders, so only check size and contents identical(dim(a),dim(b)) && all(a[order(a$Name),]==b[order(b$Name),]) } sapply(Z[-1], function(x)equivResults(x$result, Z[[1]]$result)) # f2 f3 f4 #TRUE TRUE TRUE Note that the various functions give different results if any patient comes in twice on the same day. f4 includes both visits in the ouput, the other include either the first or last (as ordered in the original file). Bill Dunlap TIBCO Software wdunlap tibco.com On Sun, Jan 25, 2015 at 1:01 AM, Göran Broström <goran.brost...@umu.se> wrote: > On 2015-01-24 01:14, William Dunlap wrote: > >> Here is one way. Sort the data.frame, first by Name then break ties with >> CheckInDate. >> Then choose the rows that are the last in a run of identical Name values. >> > > I do it by sorting by the reverse order of CheckinDate (last date first) > within Name, then > > > dLatestVisit <- dSorted[!duplicated(dSorted$Name), ] > > I guess it is faster, but who knows? > > Göran > > >> txt <- "Name CheckInDate Temp >>> >> + John 1/3/2014 97 >> + Mary 1/3/2014 98.1 >> + Sam 1/4/2014 97.5 >> + John 1/4/2014 99" >> >>> d <- read.table(header=TRUE, >>> >> colClasses=c("character","character","numeric"), text=txt) >> >>> d$CheckInDate <- as.Date(d$CheckInDate, as.Date, format="%d/%m/%Y") >>> isEndOfRun <- function(x) c(x[-1] != x[-length(x)], TRUE) >>> dSorted <- d[order(d$Name, d$CheckInDate), ] >>> dLatestVisit <- dSorted[isEndOfRun(dSorted$Name), ] >>> dLatestVisit >>> >> Name CheckInDate Temp >> 4 John 2014-04-01 99.0 >> 2 Mary 2014-03-01 98.1 >> 3 Sam 2014-04-01 97.5 >> >> >> Bill Dunlap >> TIBCO Software >> wdunlap tibco.com >> >> >> On Fri, Jan 23, 2015 at 3:43 PM, Tan, Richard <r...@panagora.com> wrote: >> >> Hi, >>> >>> Can someone help for a R question? >>> >>> I have a data set like: >>> >>> Name CheckInDate Temp >>> John 1/3/2014 97 >>> Mary 1/3/2014 98.1 >>> Sam 1/4/2014 97.5 >>> John 1/4/2014 99 >>> >>> I'd like to return a dataset that for each Name, get the row that is the >>> latest CheckInDate for that person. For the example above it would be >>> >>> Name CheckInDate Temp >>> John 1/4/2014 99 >>> Mary 1/3/2014 98.1 >>> Sam 1/4/2014 97.5 >>> >>> >>> Thank you for your help! >>> >>> Richard >>> >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/ >> posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.