Am I correct in believing that one cannot match on multiple columns? One can indeed subset on multiple criteria from different variables (or columns) but not from unique combinations thereof. I need to exclude about 10000 rows from 108000 rows of data based on several unique combinations of identifiers in two columns. Only merge() seems to be able to do that. Merge would allow me to positively select but it would not allow me to deselect (or exclude). Look at how I got around the problem. It is inelegant. Have a missed a more direct function?
x<-seq(from=1,to=10) y <-rep(1:2,5) data<-data.frame(cbind(x,y)) x<-seq(from=1,to=10) y <-rep(1:5,2) data1<-data.frame(cbind(x,y)) newdata<-rbind(data,data1) newdata[10,2] <- 3 exclude<-newdata[18:20,] #This is a simulation of real life problem #We now have two dataframes. Newdata is the data set from an experiment. In real #life it has an additional 9 columns of data #exclude is a dataframe that was manually created after discovering some quality #control issues #Any row in newdata that matches any row in exclude must be discarded match(newdata$x,exclude$x)# useless because it is only on one column match(newdata$y,exclude$y)# useless because it is only on one column newdata$x %in% exclude$x # useless because it is only on one column newdata$y %in% exclude$y # useless because it is only on one column newdata$x %in% exclude$x & newdata$y %in% exclude$y # useless, #eventhough it is using both columns, because it is not #using them in a synchronous manner. Row 10 in new data should not have been #marked "TRUE" #It was only labeled such because the 10 in the x column is indeed in the exclude x #column and the 3 in the y column is indeed in the exclude y column but not #together which(newdata$x %in% exclude$x & newdata$y %in% exclude$y)#also gets it #wrong match(newdata,exclude)# intuitively this could have worked but alas match can #only handle vectors and not dataframes. It cannot match on multiple columns #I have to stoop to the inelegant maneuver of creating a combined variable of the #two columns, albeit only temporarily paste(newdata$x,newdata$y,sep=":") %in% paste(exclude$x,exclude$y,sep=":") #or one could do this match(paste(newdata$x,newdata$y,sep=":"),paste(exclude$x,exclude$y,sep=":")) #or {which(paste(newdata$x,newdata$y,sep=":") %in% paste (exclude$x,exclude$y,sep=":"))} #Or if you want to use the result in an index term or a selection argument in a subset command {paste(newdata$x,newdata$y,sep=":") %in% paste(exclude$x,exclude$y,sep=":")==FALSE} #As in {newdata[paste(newdata$x,newdata$y,sep=":") %in% paste(exclude$x,exclude$y,sep=":")==FALSE,]} #or {subset(newdata,paste(newdata$x,newdata$y,sep=":") %in% paste(exclude$x,exclude$y,sep=":")==FALSE)} -- Farrel Buchinsky ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.