Robert,
it's not the handling of row names per se that causes the slowdown, but my
point was that if what you need is just matrix-like structure with different
column types, you may want to use lists instead and for equal column types
you're better of with a matrix.
But to address your point, one of the reasons for subassignments on data frames
being slow is that they need extra copies of the data frame for method
dispatch. Data frames are lists of column vectors, so the penalty is worse with
increasing number of columns. Rows play no significant (additional) role,
because those are simply operations on the column vectors (they need to be
copied on modification in any case).
In practice it would not matter as much unless the users do stupid things like
the example loop. In that case the list holding the columns is copied twice for
every single value of i which is deadly. Obviously the sensible thing to do
m[1:1000,1] <- 1 does not have that issue.
So to illustrate part of the data.frame penalty effect consider simply falling
back to lists in the assignment:
> example2 <- function(m){
+ for(i in 1:1000)
+ m[[1]][i] <- 1
+ }
> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
> system.time( example2(m) )
user system elapsed
44.359 13.608 58.011
> ### using a list is very fast as illustrated before:
> m <- as.list(as.data.frame(matrix(0, ncol=1000, nrow=1000)))
> system.time( example2(m) )
user system elapsed
0.01 0.00 0.01
> ### now try to fall back to a list for each iteration (part of what the data
> frames have to do):
> example3 <- function(m){
+ for(i in 1:1000) {
+ oc <- class(m)
+ class(m) <- NULL
+ m[[1]][i] <- 1
+ class(m) <- oc
+ }
+ }
> system.time( example3(m) )
user system elapsed
19.080 2.251 21.335
So just the simple fact that you unclass and re-class the object gives you half
of the penalty that data.frames incur even if you're dealing with a list. Add
the additional logic that data frames have to go through and you have the full
picture.
So, as I was saying earlier, if you want to loop subassignments over many
elements: don't do that in the first place, but if you do, use lists or
matrices, NOT data frames.
Cheers,
Simon
On Jul 3, 2011, at 8:13 AM, Robert Stojnic wrote:
>
> Hi Simon,
>
> On 03/07/11 05:30, Simon Urbanek wrote:
>> This is just a quick, incomplete response, but the main misconception is
>> really the use of data.frames. If you don't use the elaborate mechanics of
>> data frames that involve the management of row names, then they are
>> definitely the wrong tool to use, because most of the overhead is exactly to
>> manage to row names and you pay a substantial penalty for that. Just drop
>> that one feature and you get timings similar to a matrix:
>
> I tried to find some documentation on why there needs to be extra row names
> handling when one is just assigning values into the column of a data frame,
> but couldn't find any. For a while I stared at the code of `[<-.data.frame`
> but couldn't figure out it myself. Can you please summarise what exactly is
> going one when one does m[1, 1] <- 1 where m is a data frame?
>
> I found that the performance is significantly different with different number
> of columns. For instance
>
> # reassign first column to 1
> example <- function(m){
> for(i in 1:1000)
> m[i,1] <- 1
> }
>
> m <- as.data.frame(matrix(0, ncol=2, nrow=1000))
> system.time( example(m) )
>
> user system elapsed
> 0.164 0.000 0.163
>
> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
> system.time( example(m) )
>
> user system elapsed
> 34.634 0.004 34.765
>
> When m is a matrix, both run well under 0.1s.
>
> Increasing the number of rows (but not the number of iterations) leads to
> some increase in time, but not as drastic when increasing column number.
> Using m[[y]][x] in this case doesn't help either.
>
> example2 <- function(m){
> for(i in 1:1000)
> m[[1]][i] <- 1
> }
>
> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
> system.time( example2(m) )
>
> user system elapsed
> 36.007 0.148 36.233
>
>
> r.
>
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel