Robert,

it's not the handling of row names per se that causes the slowdown, but my 
point was that if what you need is just matrix-like structure with different 
column types, you may want to use lists instead and for equal column types 
you're better of with a matrix.

But to address your point, one of the reasons for subassignments on data frames 
being slow is that they need extra copies of the data frame for method 
dispatch. Data frames are lists of column vectors, so the penalty is worse with 
increasing number of columns. Rows play no significant (additional) role, 
because those are simply operations on the column vectors (they need to be 
copied on modification in any case).

In practice it would not matter as much unless the users do stupid things like 
the example loop. In that case the list holding the columns is copied twice for 
every single value of i which is deadly. Obviously the sensible thing to do 
m[1:1000,1] <- 1 does not have that issue.

So to illustrate part of the data.frame penalty effect consider simply falling 
back to lists in the assignment:

> example2 <- function(m){
+    for(i in 1:1000)
+        m[[1]][i] <- 1
+ }
> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
> system.time( example2(m) )
   user  system elapsed 
 44.359  13.608  58.011 

> ### using a list is very fast as illustrated before:
> m <- as.list(as.data.frame(matrix(0, ncol=1000, nrow=1000)))
> system.time( example2(m) )
   user  system elapsed 
   0.01    0.00    0.01 

> ### now try to fall back to a list for each iteration (part of what the data 
> frames have to do):
> example3 <- function(m){
+    for(i in 1:1000) {
+        oc <- class(m)
+        class(m) <- NULL
+        m[[1]][i] <- 1
+        class(m) <- oc
+    }
+ }
> system.time( example3(m) )
   user  system elapsed 
 19.080   2.251  21.335 

So just the simple fact that you unclass and re-class the object gives you half 
of the penalty that data.frames incur even if you're dealing with a list. Add 
the additional logic that data frames have to go through and you have the full 
picture.

So, as I was saying earlier, if you want to loop subassignments over many 
elements: don't do that in the first place, but if you do, use lists or 
matrices, NOT data frames.

Cheers,
Simon


On Jul 3, 2011, at 8:13 AM, Robert Stojnic wrote:

> 
> Hi Simon,
> 
> On 03/07/11 05:30, Simon Urbanek wrote:
>> This is just a quick, incomplete response, but the main misconception is 
>> really the use of data.frames. If you don't use the elaborate mechanics of 
>> data frames that involve the management of row names, then they are 
>> definitely the wrong tool to use, because most of the overhead is exactly to 
>> manage to row names and you pay a substantial penalty for that. Just drop 
>> that one feature and you get timings similar to a matrix:
> 
> I tried to find some documentation on why there needs to be extra row names 
> handling when one is just assigning values into the column of a data frame, 
> but couldn't find any. For a while I stared at the code of `[<-.data.frame` 
> but couldn't figure out it myself. Can you please summarise what exactly is 
> going one when one does m[1, 1] <- 1 where m is a data frame?
> 
> I found that the performance is significantly different with different number 
> of columns. For instance
> 
> # reassign first column to 1
> example <- function(m){
>    for(i in 1:1000)
>        m[i,1] <- 1
> }
> 
> m <- as.data.frame(matrix(0, ncol=2, nrow=1000))
> system.time( example(m) )
> 
>   user  system elapsed
>  0.164   0.000   0.163
> 
> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
> system.time( example(m) )
> 
>   user  system elapsed
> 34.634   0.004  34.765
> 
> When m is a matrix, both run well under 0.1s.
> 
> Increasing the number of rows (but not the number of iterations) leads to 
> some increase in time, but not as drastic when increasing column number. 
> Using m[[y]][x] in this case doesn't help either.
> 
> example2 <- function(m){
>    for(i in 1:1000)
>        m[[1]][i] <- 1
> }
> 
> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
> system.time( example2(m) )
> 
>   user  system elapsed
> 36.007   0.148  36.233
> 
> 
> r.
> 
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to