Re: [Rd] execution time of .packages

Prof Brian Ripley Tue, 03 Mar 2009 04:42:22 -0800

Let me repeat: what is happening for me in the equivalent of your35.589 - 1.18 seconds is that R is waiting for my OS to read its discs(and they can be heard chuntering away). As the R process is notrunniing at those times, the profiler is not running either (on aUnix-alike: on Windows the profiler does measure elapsed time). Iexpect it will be the same explanation for you.

What I have already suggested is that if you want to save time, do notread and check the package.rds files. As far as I can see they werechecked at installation in any recent version of R. Just check theirexistence.


On Tue, 3 Mar 2009, Romain Francois wrote:

Prof Brian Ripley wrote:

On Tue, 3 Mar 2009, Romain Francois wrote:
Prof Brian Ripley wrote:
The caching is in the disc system: you need to find and read the packagemetadata for every package. AFAIK it is not easy to flush the disccache, but quite easy to overwrite it with later reads. (Google for moreinfo.)
Thanks for the info, I'll try to find my way with these directions.
If you are not concerned about validity of the installed packages youcould skip the tests and hence the reads.
Your times are quite a bit slower than mine, so a faster disc systemmight help. Since my server has just been rebooted (for a new kernel),with all of CRAN and most of BioC I get
system.time( packs <- .packages( all = T ) )
   user  system elapsed
  0.518   0.262  25.042
system.time( packs <- .packages( all = T ) )
   user  system elapsed
  0.442   0.080   0.522
length(packs)
[1] 2096
There's a similar issue when installing packages: the Perl code reads theindices from every visible package to resolve links, and that can be slowthe first time.
On Tue, 3 Mar 2009, Romain Francois wrote:
Hello,
The first time in a session I call .packages( all.available = T ), ittakes a long time (I have many packages installed from CRAN):
system.time( packs <- .packages( all = T ) )
user  system elapsed
0.738   0.276  43.787
When I call it again, the time is now much reduced, so there must besome caching somewhere. I would like to try to reduce the time it takesthe first time, but I have not been able to identify where the cachingtakes place, and so how I can remove it to try to improve the runningtime without the caching. Without this, I have to restart my computereach time to vanish the caching to test a new version of the function(this is not going to happen)
Here is the .packages function, I am suspicious about this part : "ans<- c(ans, nam)" which grows the ans vector each time a suitable packageis found, this does not sound right.
It's OK as there are only going to be ca 2000 packages. Try profilingthis: .readRDS and grepl take most of the time.
I usually do not trust the result of the profiler when a for loop isinvolved, as it tends to miss the point (or maybe I am).
Here are the data for the actual example (repeated for this message):
Rprof()
system.time( packs <- .packages( all = T ) )
   user  system elapsed
  0.447   0.078   0.525
Rprof(NULL)
summaryRprof()
$by.self
                   self.time self.pct total.time total.pct
"grepl"                 0.18     34.6       0.18      34.6
".readRDS"              0.12     23.1       0.20      38.5
".packages"             0.08     15.4       0.50      96.2
"close.connection"      0.04      7.7       0.04       7.7
"close"                 0.02      3.8       0.06      11.5
"file.exists"           0.02      3.8       0.02       3.8
"gc"                    0.02      3.8       0.02       3.8
"gzfile"                0.02      3.8       0.02       3.8
"list"                  0.02      3.8       0.02       3.8
"system.time"           0.00      0.0       0.52     100.0
"file.path"             0.00      0.0       0.02       3.8

$by.total
                   total.time total.pct self.time self.pct
"system.time"            0.52     100.0      0.00      0.0
".packages"              0.50      96.2      0.08     15.4
".readRDS"               0.20      38.5      0.12     23.1
"grepl"                  0.18      34.6      0.18     34.6
"close"                  0.06      11.5      0.02      3.8
"close.connection"       0.04       7.7      0.04      7.7
"file.exists"            0.02       3.8      0.02      3.8
"gc"                     0.02       3.8      0.02      3.8
"gzfile"                 0.02       3.8      0.02      3.8
"list"                   0.02       3.8      0.02      3.8
"file.path"              0.02       3.8      0.00      0.0

$sampling.time
[1] 0.52
there is little tiime unaccounted for, and 0.38 sec is going in .readRDSand grepl. Whereas
system.time({
ans <- character(0)
for(i in 1:2096) ans <- c(ans, "foo")
})

takes 0.024 secs, negligible here (one profiler tick).

Here is what happens to me if I restart the computer:

Rprof( )
system.time( packs <- .packages( all = T ) )

 user  system elapsed
0.888   0.342  35.589

Rprof(NULL)
summaryRprof()

$by.self
                 self.time self.pct total.time total.pct
".readRDS"              0.34     28.8       0.64      54.2
".packages"             0.14     11.9       1.16      98.3
"file.exists"           0.14     11.9       0.14      11.9
"gzfile"                0.12     10.2       0.16      13.6
"close"                 0.10      8.5       0.14      11.9
"grepl"                 0.08      6.8       0.10       8.5
"$"                     0.08      6.8       0.08       6.8
"file.path"             0.06      5.1       0.06       5.1
"close.connection"      0.04      3.4       0.04       3.4
"getOption"             0.02      1.7       0.04       3.4
"as.character"          0.02      1.7       0.02       1.7
"gc"                    0.02      1.7       0.02       1.7
"options"               0.02      1.7       0.02       1.7
"system.time"           0.00      0.0       1.18     100.0

$by.total
                 total.time total.pct self.time self.pct
"system.time"            1.18     100.0      0.00      0.0
".packages"              1.16      98.3      0.14     11.9
".readRDS"               0.64      54.2      0.34     28.8
"gzfile"                 0.16      13.6      0.12     10.2
"file.exists"            0.14      11.9      0.14     11.9
"close"                  0.14      11.9      0.10      8.5
"grepl"                  0.10       8.5      0.08      6.8
"$"                      0.08       6.8      0.08      6.8
"file.path"              0.06       5.1      0.06      5.1
"close.connection"       0.04       3.4      0.04      3.4
"getOption"              0.04       3.4      0.02      1.7
"as.character"           0.02       1.7      0.02      1.7
"gc"                     0.02       1.7      0.02      1.7
"options"                0.02       1.7      0.02      1.7

$sampling.time
[1] 1.18

I'd like to know what is happening in the 35.589 - 1.18 seconds, and theprofiler won't tell me.

About the time spent by grepl, we could take this down by calling it onceinstead of many times

system.time( grepl( valid_package_version_regexp, versions ) )

 user  system elapsed
0.003   0.000   0.009

system.time( for(v in versions) grepl( valid_package_version_regexp, v ) )

user  system elapsed
0.100   0.000   0.136

Consider this script below,


Whether profiling works in other examples is beside the point here.

I'll make another thread on that.

--
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr


--
Brian D. Ripley,                  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] execution time of .packages

Reply via email to