Re: [R] Find "undirected" duplicates in a tibble

Greg Minshall Sat, 21 Aug 2021 03:07:25 -0700

Bert,

> Turns out that there's an even faster alternative.


hah, but i'm *not* surprised.  thanks for sharing the (current)
low-price leader!

and, thanks, again, to Kimmo for posting such a productive question!

cheers, Greg

----
my apply
   user  system elapsed 
  8.465   0.103   8.578 
Bert's !duplicated
   user  system elapsed 
  2.397   0.000   2.399 
Bert's x[,2]>x[,1]
   user  system elapsed 
  1.068   0.000   1.069 
Bert's table()-based   user  system elapsed 
  0.235   0.000   0.235 
my a.d.f(unique(cbind(do.call)))
   user  system elapsed 
  4.470   0.000   4.475 
Eric Berger's unique(...pmin...pmax)
   user  system elapsed 
  0.820   0.017   0.837 
Eric Berger's transmuting tibble...
   user  system elapsed 
  0.936   0.000   0.938 
Kimmo Elo's [OP] mutating paste
   user  system elapsed 
 50.769   0.000  50.821 
Rui Barradas' sort-based
   user  system elapsed 
 42.001   0.053  42.112 
----

----
n <- 1000
x <- expand.grid(Source = 1:n, Target = 1:n)

cat("my apply\n")
system.time({
 y <- apply(x, 1, function(y) return (c(A=min(y), B=max(y))))
 unique(t(y))})
#   user  system elapsed
#  5.075   0.034   5.109

cat("Bert's !duplicated\n")
system.time({
 x[!duplicated(cbind(do.call(pmin, x), do.call(pmax, x))), ]
 })
#   user  system elapsed
#  1.340   0.013   1.353

# Still more efficient and still returning a data frame is:
cat("Bert's x[,2]>x[,1]\n")
system.time({
 w <- x[, 2] > x[,1]
 x[w, ] <- x[w, 2:1]
 unique(x)})
#   user  system elapsed
#  0.693   0.011   0.703

cat("Bert's table()-based")
f <- function(x){
   w <- x[,2] > x[,1]
   x[w, ] <- x[w, 2:1]
   x$counts <- as.vector(table(x)) ## drop the dim
   x[x$counts>0, ]
}
system.time({
  f(x)
})

cat("my a.d.f(unique(cbind(do.call)))\n")
system.time({
  as.data.frame(unique(cbind(A=do.call(pmin,x), B=do.call(pmax,x))))
})

cat("Eric Berger's unique(...pmin...pmax)\n")
system.time({
  unique(data.frame(V1=pmin(x$Source,x$Target), V2=pmax(x$Source,x$Target)))
})

cat("Eric Berger's transmuting tibble...\n")
require(dplyr)
xt<-tibble(x)
system.time({
  xt %>% transmute( a=pmin(Source,Target), b=pmax(Source,Target)) %>%
    unique() %>% rename(Source=a, Target=b)
})

cat("Kimmo Elo's [OP] mutating paste\n")
system.time({
  xt %>%
    mutate(pair=mapply(function(x,y)
      paste0(sort(c(x,y)),collapse="-"), Source, Target)) %>%
    distinct(pair,
             .keep_all = T) %>%
    mutate(Source=sapply(pair, function(x)
      unlist(strsplit(x, split="-"))[1]), Target=sapply(pair, function(x)
        unlist(strsplit(x, split="-"))[2])) %>%
    select(-pair)
})

cat("Rui Barradas' sort-based\n")
system.time({
  apply(x, 1, sort) |> t() |> unique()
})

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Find "undirected" duplicates in a tibble

Reply via email to