Hi, Encouraged by a tip from Simon Urbanek I tried to use the S3 machinery to write a faster version of the data.frame class. This quickly hits a snag: the "[.default"(x, i) for some reason cares about the dimensionality of x. In the end there is a full transcript of my R session. It includes the motivation for writing the class and the problems I have encountered.
As a result I see three issues here: * why "[.default"(x, i) doesn't work if dim(x) is 2? After all a single subscript into a vector works regardless of whether it's a matrix or not. Is there an alternative way to access "[.default"? * why does unclass() make deep copy? This is a facet of the global over-conservatism of R with respect to copying. * is it possible to add some sort copy profiling to R? Something like copyProfiling(TRUE), which should cause R to log sizes of each copied object (just raw sizes w/o any attempt to identify the object). This feature should at least help assess the magnitude of the problem. Thanks, Vadim Now the transcript itself: > # the motivation: subscription of a data.frame is *much* (almost 20 times) slower than that of a list > # compare > n = 1e6 > i = seq(n) > > x = data.frame(a=seq(n), b=seq(n)) > system.time(x[i,], gcFirst=TRUE) [1] 1.01 0.14 1.14 0.00 0.00 > > x = list(a=seq(n), b=seq(n)) > system.time(lapply(x, function(col) col[i]), gcFirst=TRUE) [1] 0.06 0.00 0.06 0.00 0.00 > > > # the solution: define methods for the light-weight data.frame class > lwdf = function(...) structure(list(...), class = "lwdf") > > # dim > dim.lwdf = function(x) c(length(x[[1]]), length(x)) > > # for pretty printing we define print.lwdf via a conversion to data.frame > # as.data.frame.lwdf > as.data.frame.lwdf = function(x) structure(unclass(x), class="data.frame", row.names=as.character(seq(nrow(x)))) > > # print > print.lwdf = function(x) print.data.frame(as.data.frame.lwdf(x)) > > # now the real stuff > > # "[" > # the naive "[.lwdf" = function (x, i, j) lapply(x[j], function(col) col[i]) > # won't work because evaluation of x[j] calls "[.lwdf" again and not "[.default" > # so we switch by the number of arguments > "[.lwdf" = function (x, i, j) { + if (nargs() == 2) + NextMethod("[", x, i) + else + structure(lapply(x[j], function(col) col[i]), class = "lwdf") + } > > x = lwdf(a=seq(3), b=letters[seq(3)], c=as.factor(letters[seq(3)])) > i = c(1,3); j = c(1,3) > > # unfortunately, for some reasons "[.default" cares about dimensionality of its argument > x[i,j] Error in "[.default"(x, j) : incorrect number of dimensions > > > # we could use unclass to get it right > "[.lwdf" = function (x, i, j) { + structure(lapply(unclass(x)[j], function(col) col[i]), class = "lwdf") + } > > x[i,j] a c 1 1 a 2 3 c > > # *but* unclass creates a deep copy of its argument as indirectly evidenced by the following timing > x = lwdf(a=seq(1e6)); system.time(unclass(x)) [1] 0.01 0.00 0.01 0.00 0.00 > x = lwdf(a=seq(1e8)); system.time(unclass(x)) [1] 0.44 0.39 0.82 0.00 0.00 > version _ platform x86_64-unknown-linux-gnu arch x86_64 os linux-gnu system x86_64, linux-gnu status major 2 minor 0.1 year 2004 month 11 day 15 language R ______________________________________________ R-devel@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-devel