Re: [R] Significant performance difference between split of a data.frame and split of vectors

Charles C. Berry Wed, 09 Dec 2009 13:17:01 -0800

On Wed, 9 Dec 2009, Peng Yu wrote:

On Wed, Dec 9, 2009 at 11:20 AM, Charles C. Berry <cbe...@tajo.ucsd.edu> wrote:

On Wed, 9 Dec 2009, Peng Yu wrote:

On Tue, Dec 8, 2009 at 11:06 PM, David Winsemius <dwinsem...@comcast.net>
wrote:


On Dec 9, 2009, at 12:00 AM, Peng Yu wrote:

On Tue, Dec 8, 2009 at 10:37 PM, David Winsemius
<dwinsem...@comcast.net>
wrote:


On Dec 8, 2009, at 11:28 PM, Peng Yu wrote:

I have the following code, which tests the split on a data.frame and
the split on each column (as vector) separately. The runtimes are of
10 time difference. When m and k increase, the difference become even
bigger.

I'm wondering why the performance on data.frame is so bad. Is it a bug
in R? Can it be improved?


You might want to look at the data.table package. The author calinms
significant speed improvements over dta.frames


This bug has been found long time back and a package has been
developed for it. Should the fix be integrated in data.frame rather
than be implemented in an additional package?


What bug?


Is the slow speed in splitting a data.frame a performance bug?


NO!

The two computations are not equivalent.

One is a list whose elements are split vectors, and the other is a list of
data.frames containing those vectors.


I made a comparable example below. Still splitting data.frame is much
slower comparing with the second way that I'm showing.

If you take the trouble to assemble that list of data frames from the list
of split vectors you will see that it is very time consuming.


It is not as I show in the example below.



You are comparing creating a matrix to creating a data.frame.

system.time(

+  spl<-   mapply(
+         function(...) {
+           cbind(...)
+         }
+         , v[[1]], v[[2]], v[[3]], v[[4]], v[[5]], v[[6]]
+         )
+     )
   user  system elapsed
  1.204   0.016   1.478


system.time(
+  spl<-   mapply(
+         function(...) {
+           data.frame(...)
+         }
+         , v[[1]], v[[2]], v[[3]], v[[4]], v[[5]], v[[6]],SIMPLIFY=FALSE
+         )
+     )
   user  system elapsed
 56.088   0.104  56.478



If you just want a list of matrices, use

system.time(split.data.frame(x,f))

   user  system elapsed
  0.524   0.016   0.927

Read up on memory management issues. Think about what the computer actually
has to do in terms of memory access to split a data.frame versus split a
vector.


I'd like to read more on how R do memory management. Would you please
point me a good source?

I see now that the timing issue was not one of memory, but of doing morework (see Rprof results below) to create a data.frame. But if you areinterested you might look at

Golub, Gene H.; Van Loan, Charles F. (1996), Matrix Computations (3rded.), Johns Hopkins, ISBN 978-0-8018-5414-9 .


and/or Google "BLAS memory"


But again, R is not user friendly. It took me quite a long time to
figure out that splitting a data.frame is a bottle neck in my program
and reduce the problem into a test case.



See

        ?Rprof

and note where the 'self.time's are largest below( not in split orsplit.data.frame) :

Rprof()
res <- split(as.data.frame(x),f)
Rprof(NULL)
summaryRprof()

$by.self
                        self.time self.pct total.time total.pct
"attr"                      33.66     72.9      33.66      72.9
"[.data.frame"               3.26      7.1      45.70      98.9
"inherits"                   1.52      3.3       2.06       4.5
"anyDuplicated"              1.04      2.3       1.42       3.1
"[[.data.frame"              1.00      2.2       4.76      10.3
"[["                         0.74      1.6       5.50      11.9
"match"                      0.66      1.4       2.96       6.4
"<Anonymous>"                0.66      1.4       0.72       1.6
"sys.call"                   0.46      1.0       0.46       1.0
"all"                        0.38      0.8       0.38       0.8
"anyDuplicated.default"      0.36      0.8       0.38       0.8
"%in%"                       0.32      0.7       3.26       7.1
"names"                      0.26      0.6       0.26       0.6
"is.factor"                  0.24      0.5       2.30       5.0
"length"                     0.20      0.4       0.20       0.4
"attr<-"                     0.18      0.4       0.18       0.4
"as.character"               0.16      0.3       0.16       0.3
"["                          0.14      0.3      45.84      99.2
"-"                          0.14      0.3       0.14       0.3
"!"                          0.12      0.3       0.12       0.3
".Call"                      0.12      0.3       0.12       0.3
"!="                         0.10      0.2       0.10       0.2
"vector"                     0.06      0.1       0.26       0.6
"as.data.frame.matrix"       0.06      0.1       0.08       0.2
"|"                          0.06      0.1       0.06       0.1
"lapply"                     0.04      0.1      46.12      99.8
"<"                          0.04      0.1       0.04       0.1
"any"                        0.04      0.1       0.04       0.1
"is.na"                      0.04      0.1       0.04       0.1
".subset2"                   0.04      0.1       0.04       0.1
">"                          0.02      0.0       0.02       0.0
"as.vector"                  0.02      0.0       0.02       0.0
"dim"                        0.02      0.0       0.02       0.0
"is.matrix"                  0.02      0.0       0.02       0.0
"unique.default"             0.02      0.0       0.02       0.0
"split"                      0.00      0.0      46.20     100.0
"split.data.frame"           0.00      0.0      46.12      99.8
"FUN"                        0.00      0.0      45.84      99.2
"factor"                     0.00      0.0       0.24       0.5
"is.vector"                  0.00      0.0       0.24       0.5
"split.default"              0.00      0.0       0.24       0.5
"as.data.frame"              0.00      0.0       0.08       0.2
"unique"                     0.00      0.0       0.02       0.0

[output truncated]

Chuck

I don't know how memory

management is done in R so that I don't know if it is possible to fix
the problem for splitting a data.frame without perturbing the
interface of data.frame. But if the speed of splitting data.frame is
so slow, maybe it can be forbidden and an alternative can be
documented somewhere.

---

And even if it were simply a matter of having code that is slow for some
application, that would not be a bug. Read the FAQ!


The definition of a bug is on the FAQ is narrower than what I thought.
No matter what a definition of a bug is, split() on a data.frame is
perfectly legitimate operation (in terms of an interface). A quick fix
to this problem is to at least single out the case where the argument
is a data.frame, and to do what I have been doing below. Therefore,
that is why I say this is a performance bug. Similar cases, where a
faster alternative can be done but is not done, are perfect to call
bugs, at least in many other languages.

m=300000
n=6
k=30000

set.seed(0)
x=replicate(n,rnorm(m))
f=sample(1:k, size=m, replace=T)

system.time(split(as.data.frame(x),f))

  user  system elapsed
39.020   0.010  39.084


v=lapply(

+     1:dim(x)[[2]]
+     , function(i) {
+       split(x[,i],f)
+     }
+     )


system.time(lapply(

+         1:dim(x)[[2]]
+         , function(i) {
+           split(x[,i],f)
+         }
+         )
+     )
  user  system elapsed
 2.520   0.000   2.526


system.time(

+     mapply(
+         function(...) {
+           cbind(...)
+         }
+         , v[[1]], v[[2]], v[[3]], v[[4]], v[[5]], v[[6]]
+         )
+     )
  user  system elapsed
 0.920   0.000   0.927

David.

system.time(split(as.data.frame(x),f))


?user ?system elapsed
?1.700 ? 0.010 ? 1.786


system.time(lapply(


+ ? ? ? ? 1:dim(x)[[2]]
+ ? ? ? ? , function(i) {
+ ? ? ? ? ? split(x[,i],f)
+ ? ? ? ? }
+ ? ? ? ? )
+ ? ? )
?user ?system elapsed
?0.170 ? 0.000 ? 0.167

###########
m=30000
n=6
k=3000

set.seed(0)
x=replicate(n,rnorm(m))
f=sample(1:k, size=m, replace=T)

system.time(split(as.data.frame(x),f))

system.time(lapply(
? ? ?1:dim(x)[[2]]
? ? ?, function(i) {
? ? ? ?split(x[,i],f)
? ? ?}
? ? ?)
?)

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
Heritage Laboratories
West Hartford, CT


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
Heritage Laboratories
West Hartford, CT


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Charles C. Berry ? ? ? ? ? ? ? ? ? ? ? ? ? ?(858) 534-2098
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Dept of Family/Preventive
Medicine
E mailto:cbe...@tajo.ucsd.edu ? ? ? ? ? ? ? UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ ?La Jolla, San Diego 92093-0901


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Charles C. Berry                            (858) 534-2098
                                            Dept of Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu               UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Significant performance difference between split of a data.frame and split of vectors

Reply via email to