On Tue, Mar 13, 2012 at 08:56:33PM -0700, mdvaan wrote:
> Hi,
>
> I have data on individuals (B) who participated in events (A). If ALL
> participants in an event are a subset of the participants in another event I
> would like to remove the smaller event and if the participants in one event
> are exactly similar to the participants in another event I would like to
> remove one of the events (I don't care which one). The following example
> does that however it is extremely slow (and the true dataset is very large).
> What would be a more efficient way to solve the problem? I really appreciate
> your help. Thanks!
>
> DF <- data.frame(read.table(textConnection(" A B
> 12095 69832
> 12095 51750
> 12095 6734
...
Hi.
If a lot of events are eliminated, then the following may
be faster, since eliminated events are removed before
the further comparisons take place.
data <- unique(DF$A)
gr <- split(DF$B, f=factor(DF$A, levels=data))
gr <- lapply(gr, FUN=sort)
gr <- lapply(gr, FUN=unique)
accept <- rep(FALSE, times=length(gr))
accept[1] <- TRUE
for (i in seq.int(from=2, length=length(accept)-1)) {
cand <- gr[[i]]
OK <- TRUE
for (j in which(accept)) {
prev <- gr[[j]]
both <- unique(sort(c(cand, prev)))
if (identical(prev, both)) {
OK <- FALSE
break
}
}
if (OK) {
for (j in which(accept)) {
prev <- gr[[j]]
both <- unique(sort(c(cand, prev)))
if (identical(cand, both)) {
accept[j] <- FALSE
}
}
accept[i] <- TRUE
}
}
DF2 <- DF[DF$A %in% data[accept], ]
Can you afford to compute table(DF$A, DF$B) for the real data?
Its size will be proportional to length(unique(DF$A))*length(unique(DF$B)).
Petr Savicky.
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.