Re: [Rd] execution time of .packages

Romain Francois Tue, 03 Mar 2009 03:03:54 -0800

Prof Brian Ripley wrote:

The caching is in the disc system: you need to find and read thepackage metadata for every package. AFAIK it is not easy to flush thedisc cache, but quite easy to overwrite it with later reads. (Googlefor more info.)

Thanks for the info, I'll try to find my way with these directions.

If you are not concerned about validity of the installed packages youcould skip the tests and hence the reads.
Your times are quite a bit slower than mine, so a faster disc systemmight help. Since my server has just been rebooted (for a newkernel), with all of CRAN and most of BioC I get
system.time( packs <- .packages( all = T ) )
   user  system elapsed
  0.518   0.262  25.042
system.time( packs <- .packages( all = T ) )
   user  system elapsed
  0.442   0.080   0.522
length(packs)
[1] 2096
There's a similar issue when installing packages: the Perl code readsthe indices from every visible package to resolve links, and that canbe slow the first time.
On Tue, 3 Mar 2009, Romain Francois wrote:
Hello,
The first time in a session I call .packages( all.available = T ), ittakes a long time (I have many packages installed from CRAN):
system.time( packs <- .packages( all = T ) )
user  system elapsed
0.738   0.276  43.787
When I call it again, the time is now much reduced, so there must besome caching somewhere. I would like to try to reduce the time ittakes the first time, but I have not been able to identify where thecaching takes place, and so how I can remove it to try to improve therunning time without the caching. Without this, I have to restart mycomputer each time to vanish the caching to test a new version of thefunction (this is not going to happen)
Here is the .packages function, I am suspicious about this part :"ans <- c(ans, nam)" which grows the ans vector each time a suitablepackage is found, this does not sound right.
It's OK as there are only going to be ca 2000 packages. Try profilingthis: .readRDS and grepl take most of the time.

I usually do not trust the result of the profiler when a for loop isinvolved, as it tends to miss the point (or maybe I am).

Consider this script below, the profiler reports 0.22 seconds when theactual time spent is about 6 seconds, and would blame rnorm as thebottleneck when the inefficiency is in with growing the data structure.


Rprof( )
x <- numeric( )
for( i in 1:10000){
 x <- c( x, rnorm(10) )
}
Rprof( NULL )
print( summaryRprof( ) )

$ time Rscript --vanilla profexample.R
$by.self
       self.time self.pct total.time total.pct
"rnorm"      0.22      100       0.22       100

$by.total
       total.time total.pct self.time self.pct
"rnorm"       0.22       100      0.22      100

$sampling.time
[1] 0.22

real    0m6.164s
user    0m5.156s
sys     0m0.737s

$ time Rscript --vanilla -e "rnorm(10)"
[1]  0.836411851  1.762081444  1.076305644  2.063515383  0.643254750
[6]  1.698620443 -1.774479062 -0.432886214 -0.007949533  0.284089832

real    0m0.224s
user    0m0.187s
sys     0m0.024s

Now, if i replace the for loop with a similar silly lapply construct,profiler tells me a rather different story:


Rprof( )
x <- numeric( )
y <- lapply( 1:10000, function(i){
   x <<- c( x, rnorm(10) )
   NULL
} )
Rprof( NULL )
print( summaryRprof( ) )

$ time Rscript --vanilla prof2.R

$by.selfself.time self.pct total.time total.pct"FUN" 6.48 96.1 6.68 99.1"rnorm" 0.20 3.0 0.20 3.0

"lapply"      0.06      0.9       6.74     100.0

$by.total
        total.time total.pct self.time self.pct
"lapply"       6.74     100.0      0.06      0.9
"FUN"          6.68      99.1      6.48     96.1
"rnorm"        0.20       3.0      0.20      3.0

$sampling.time
[1] 6.74

real    0m8.352s
user    0m4.762s
sys     0m2.574s

Or let us wrap the for loop of the first example in a function:

Rprof( )
x <- numeric( )
ffor <- function(){
   for( i in 1:10000){
     x <- c( x, rnorm(10) )
   }
}
ffor()
Rprof( NULL )
print( summaryRprof( ) )

$ time Rscript --vanilla prof3.R
$by.self
       self.time self.pct total.time total.pct
"ffor"        5.4     96.4        5.6     100.0
"rnorm"       0.2      3.6        0.2       3.6

$by.total
       total.time total.pct self.time self.pct
"ffor"         5.6     100.0       5.4     96.4
"rnorm"        0.2       3.6       0.2      3.6

$sampling.time
[1] 5.6

real    0m6.379s
user    0m5.408s
sys     0m0.717s

Maybe I get this all wrong, maybe the global assignment operator isresponsible for some of the time in the second example. But how can Ianalyse the result of profiler in the first example when it seems toonly be interested in the .22 seconds when I want to know what is goingon with the rest of the time.

Is it possible to treat "for" as a function when writing the profilerdata so that I can trust it more ?


Romain

--
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] execution time of .packages

Reply via email to