Hello,
Someone pointed out to me off list about this construct:
nodup_sort <- function(x, fun = nodup3){
i <- sort.list(x)
x[i] <- fun(x[i])
x
}
which deals more efficiently with the reordering.
> x <- sample( 1:100000, size = 300000, replace = TRUE )
> system.time( nodup_cpp( x ) )
utilisateur système écoulé
0.127 0.005 0.132
> system.time( nodup_sort( x, nodup3 ) )
utilisateur système écoulé
0.287 0.009 0.295
> system.time( nodup_sort( x, nodup3a ) )
utilisateur système écoulé
0.168 0.010 0.179
> system.time( nodup_sort( x, nodup4 ) )
utilisateur système écoulé
0.157 0.005 0.163
> system.time( nodup_sort( x, nodup_cpp_assumingsorted ) )
utilisateur système écoulé
0.096 0.001 0.097
So in this example, it seems more efficient to sort first and use the
algorithm assuming that the data is sorted.
There is probably a way to be smarter in nodup_cpp where the bottleneck
is likely to be related to map::find.
Another version taking some more internally :
nodup_cpp_hybrid <- cxxfunction( signature( x_ = "numeric", sort_ =
"integer" ), '
NumericVector input(x_) ;
NumericVector x = clone<NumericVector>( x_) ;
IntegerVector sort( sort_ ) ;
int n = x.size() ;
double current, previous = input[ sort[0] - 1 ] ;
double index = 0.0 ;
int si ;
for( int i=1; i<n; i++){
si = sort[i] - 1;
current = input[ si ] ;
if( current == previous ){
index += .01 ;
x[ si ] = current + index ;
} else {
index = 0.0 ;
}
previous = current ;
}
return x ;
', plugin = "Rcpp" )
no big difference:
> system.time( res6 <- nodup_cpp_hybrid( x, sort.list(x) ) )
utilisateur système écoulé
0.092 0.000 0.092
Profiling reveals this:
> Rprof()
> for(i in 1:100) { res6 <- ( nodup_cpp_hybrid( x, sort.list(x) ) ) }
> Rprof(NULL)
> summaryRprof()
$by.self
self.time self.pct total.time total.pct
"sort.list" 6.50 90.03 6.50 90.03
".Call" 0.42 5.82 0.42 5.82
"file.exists" 0.30 4.16 0.30 4.16
$by.total
total.time total.pct self.time self.pct
"nodup_cpp_hybrid" 7.22 100.00 0.00 0.00
"sort.list" 6.50 90.03 6.50 90.03
".Call" 0.42 5.82 0.42 5.82
"file.exists" 0.30 4.16 0.30 4.16
$sample.interval
[1] 0.02
$sampling.time
[1] 7.22
The 4.16 % taken by file.exists indicates that someone in the inline
project has to do some work (on my TODO list).
But otherwise sort.list dominates the time.
Romain
Le 26/11/10 21:22, Romain Francois a écrit :
Le 26/11/10 21:13, Romain Francois a écrit :
Hello,
Can we really make the assumption that the data is sorted. The original
example was not:
I am working on a function to make a duplicated value unique. For
example,
the original vector would be like : a = c(2,1,1,3,3,3,4)
If we can make the assumption, here is a C++ based version:
nodup_cpp_assumingsorted <- cxxfunction( signature( x_ = "numeric" ), '
// since we modify x, we need to make a copy
NumericVector x = clone<NumericVector>(x_);
int n = x.size() ;
double current, previous = x[0] ;
int index ;
for( int i=1; i<n; i++){
current = x[i] ;
if( current == previous ){
x[i] = current + (++index) / 100.0 ;
} else {
index = 0 ;
}
previous = current ;
}
return x ;
', plugin = "Rcpp" )
with these results:
> x <- sort( sample( 1:100000, size = 300000, replace = TRUE ) )
> system.time( nodup3( x ) )
utilisateur système écoulé
0.090 0.004 0.094
> system.time( nodup3a( x ) )
utilisateur système écoulé
0.036 0.005 0.040
> system.time( nodup4( x ) )
utilisateur système écoulé
0.025 0.004 0.029
> system.time( nodup_cpp_assumingsorted( x) )
utilisateur système écoulé
0.003 0.001 0.004
Now, if we don't make the assumption that the data is sorted, here is
another C++ based version:
require( inline )
require( Rcpp )
nodup_cpp <- cxxfunction( signature( x_ = "numeric" ), '
// since we modify x, we need to make a copy
NumericVector x = clone<NumericVector>(x_);
typedef std::map<double,int> imap ;
typedef imap::value_type pair ;
imap index ;
int n = x.size() ;
double current, previous = x[0] ;
index.insert( pair( previous, 0 ) );
imap::iterator it = index.begin() ;
for( int i=1; i<n; i++){
current = x[i] ;
if( current == previous ){
x[i] = current + ( ++(it->second) / 100.0 ) ;
} else {
it = index.find(current) ;
if( it == index.end() ){
it = index.insert(
current > previous ? it : index.begin(),
pair( current, 0 )
) ;
} else {
x[i] = current + ( ++(it->second) / 100.0 ) ;
}
previous = current ;
}
}
return x ;
', plugin = "Rcpp" )
which gives me this :
> x <- sample( 1:100000, size = 300000, replace = TRUE )
>
> system.time( nodup_cpp( x ) )
utilisateur système écoulé
0.111 0.002 0.113
> system.time( nodup3( sort( x ) ) )
utilisateur système écoulé
0.162 0.011 0.172
> system.time( nodup3a( sort( x ) ) )
utilisateur système écoulé
0.099 0.009 0.109
> system.time( nodup4( sort( x ) ) )
utilisateur système écoulé
0.089 0.004 0.094
so nodup4 is still faster, but the values are not in the right order:
> x <- c( 2, 1, 1, 2 )
> nodup4( sort( x ) )
[1] 1.00 1.01 2.00 2.01
> nodup_cpp( x )
[1] 2.00 1.00 1.01 2.01
Romain
I think this gives a more fair comparison :
> system.time( nodup_cpp( x ) )
utilisateur système écoulé
0.113 0.002 0.114
> system.time( { oo <- order(order(x)) ; nodup3( sort( x ) )[oo] } )
utilisateur système écoulé
0.336 0.012 0.347
> system.time( { oo <- order(order(x)) ; nodup3a( sort( x ) )[oo] } )
utilisateur système écoulé
0.251 0.011 0.262
> system.time( { oo <- order(order(x)) ; nodup4( sort( x ) )[oo] } )
utilisateur système écoulé
0.287 0.006 0.294
Romain
Le 26/11/10 20:01, William Dunlap a écrit :
-----Original Message-----
From: William Dunlap
Sent: Thursday, November 25, 2010 9:31 AM
To: 'randomcz'; r-help@r-project.org
Subject: RE: [R] help: program efficiency
If the input vector t is known to be ordered
(or if you only care about runs of duplicated
values, not all duplicated values) the following
is pretty quick
nodup3<- function (t) {
t + (sequence(rle(t)$lengths) - 1)/100
}
If you don't know if the the input will be ordered
then ave() will do it a bit faster than your
code
nodup2<- function (t) {
ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100)
}
E.g., for a sorted sequence of 300,000 numbers drawn with
replacement from 1:100,000 I get:
a2<- sort(sample(1:1e5, size=3e5, replace=TRUE))
system.time(v<- nodup(a2))
user system elapsed
2.78 0.05 3.97
system.time(v2<- nodup2(a2))
user system elapsed
1.83 0.02 2.66
system.time(v3<- nodup3(a2))
user system elapsed
0.18 0.00 0.14
identical(v,v2)&& identical(v,v3)
[1] TRUE
If speed is truly an issue, the built-in sequence may
be replaced by a faster one that does the same thing:
nodup3a<- function (t) {
faster.sequence<- function(nvec) {
seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])),
nvec)
}
t + (faster.sequence(rle(t)$lengths) - 1)/100
}
That took 0.05 seconds on the a2 dataset and produced
identical results.
rle() computes a sort of second difference and
nodup3a computes a cumsum on that second diffence,
to get back to a first difference. The following
avoids that wasted operation (along with rle's
computation of the values component of its output).
nodup4<- function(t) {
n<- length(t)
p<- c(0L, which(t[-1L] != t[-n]), n)
t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100
}
That reduced nodup3a's time by about 30% on that dataset.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-boun...@r-project.org
[mailto:r-help-boun...@r-project.org] On Behalf Of randomcz
Sent: Thursday, November 25, 2010 6:49 AM
To: r-help@r-project.org
Subject: [R] help: program efficiency
hey guys,
I am working on a function to make a duplicated value unique.
For example,
the original vector would be like : a = c(2,1,1,3,3,3,4)
I'll like to transform it into:
a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
basically, find the duplicates and assign a unique value by
adding a small
amount and keep it in order.
I come up with the following codes, but it runs slow if t is
large. Is there
a better way to do it?
nodup = function(t)
{
t.index=0
t.dup=duplicated(t)
for (i in 2:length(t))
{
if (t.dup[i]==T)
t.index=t.index+0.01
else t.index=0
t[i]=t[i]+t.index
}
return(t)
}
--
View this message in context:
http://r.789695.n4.nabble.com/help-program-efficiency-tp305907
9p3059079.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Romain Francois
Professional R Enthusiast
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
|- http://bit.ly/9VOd3l : ZAT! 2010
|- http://bit.ly/c6DzuX : Impressionnism with R
`- http://bit.ly/czHPM7 : Rcpp Google tech talk on youtube
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.