Chris Wallace <[EMAIL PROTECTED]> writes: > I am struggling with migrating some stata code to R. I have a data > frame containing, sometimes, repeat observations (rows) of the same > family. I want to keep only one observation per family, selecting > that observation according to some other variable. An example data > frame is: > > # construct example data > fam <- c(1,2,3,3,4,4,4) > wt <- c(1,1,0.6,0.4,0.4,0.4,0.2) > keep <- c(1,1,1,0,1,0,0) > dat <- as.data.frame(cbind(fam,wt,keep)) > dat > > I want to keep the observation for which wt is a maximum, and where > this doesn't identify a unique observation, to keep just one anyway, > not caring which. Those observations are indicated above by keep==1. > (Note, keep <- c(1,1,1,0,0,1,0) would be fine too, but not > c(1,1,1,0,0,0,1)). > > The stata code I would use is > bys fam (wt): keep if _n==_N > > This is my (long-winded) attempt in R: > > # first keep those rows where wt=max_fam(wt) > maxwt <- by(dat,dat$fam,function(x) max(x[,2])) > maxwt <- sapply(maxwt,"[[",1) > maxwt.dat <- data.frame("maxwt"=maxwt,"fam"=as.integer(names(maxwt))) > dat <- merge(dat,maxwt.dat) > dat <- dat[dat$wt==dat$maxwt,] > dat > > Now I am stuck - I want to keep either row with fam==4, and have tried > playing around with combinations of sample and apply or by, but with > no success. I can only find an inefficient for-loop solution: > > # identify those rows with >1 observation > more <- by(dat,dat$fam,function(x) dim(x)[1]) > more <- sapply(more,"[[",1) > more.dat <- data.frame("more"=more,"fam"=as.integer(names(more))) > dat <- merge(dat,more.dat) > > # sample from those for whom more>1 > result<-dat[dat$more==1,] > for(f in unique(dat$fam[dat$more>1])) { > rows <- rownames(dat[dat$fam==f,]) > result <- rbind(result,dat[sample(rows,1),]) > } > result > > I am sure that for something so simple in stata to be so complicated > in R must indicate ignorance of R on my part, but searches of help > files and RSiteSearch hasn't led to any better solution. > > Any suggestions would be most helpful! Thanks, C.
How about unsplit(lapply(split(dat,dat$fam), function(x) seq(length=nrow(x)) == which.max(x$wt)), dat$fam) or do.call("rbind", lapply(split(dat,dat$fam), function(x) x[which.max(x$wt),])) or (same thing, basically) do.call("rbind", by(dat,dat$fam,function(x) x[which.max(x$wt),])) -- O__ ---- Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html