Simon (and all), I've tried to make assignment as fast as calling `[<-.data.table` directly, for user convenience. Profiling shows (IIUC) that it isn't dispatch, but x being copied. Is there a way to prevent '[<-' from copying x? Small reproducible example in vanilla R 2.13.0 :
> x = list(a=1:10000,b=1:10000) > class(x) = "newclass" > "[<-.newclass" = function(x,i,j,value) x # i.e. do nothing > tracemem(x) [1] "<0xa1ec758>" > x[1,2] = 42L tracemem[0xa1ec758 -> 0xa1ec558]: # but, x is still copied, why? > I've tried returning NULL from [<-.newclass but then x gets assigned NULL : > "[<-.newclass" = function(x,i,j,value) NULL > x[1,2] = 42L tracemem[0xa1ec558 -> 0x9c5f318]: > x NULL > Any pointers much appreciated. If that copy is preventable it should save the user needing to use `[<-.data.table`(...) syntax to get the best speed (20 times faster on the small example used so far). Matthew On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote: > Simon, > > Thanks for the great suggestion. I've written a skeleton assignment > function for data.table which incurs no copies, which works for this > case. For completeness, if I understand correctly, this is for : > i) convenience of new users who don't know how to vectorize yet > ii) more complex examples which can't be vectorized. > > Before: > > > system.time(for (r in 1:R) DT[r,20] <- 1.0) > user system elapsed > 12.792 0.488 13.340 > > After : > > > system.time(for (r in 1:R) DT[r,20] <- 1.0) > user system elapsed > 2.908 0.020 2.935 > > Where this can be reduced further as follows : > > > system.time(for (r in 1:R) `[<-.data.table`(DT,r,2,1.0)) > user system elapsed > 0.132 0.000 0.131 > > > > Still working on it. When it doesn't break other data.table tests, I'll > commit to R-Forge ... > > Matthew > > > On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote: > > Timothée, > > > > On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote: > > > > > Hi -- > > > > > > It's my first post on this list; as a relatively new user with little > > > knowledge of R internals, I am a bit intimidated by the depth of some > > > of the discussions here, so please spare me if I say something > > > incredibly silly. > > > > > > I feel that someone at this point should mention Matthew Dowle's > > > excellent data.table package > > > (http://cran.r-project.org/web/packages/data.table/index.html) which > > > seems to me to address many of the inefficiencies of data.frame. > > > data.tables have no row names; and operations that only need data from > > > one or two columns are (I believe) just as quick whether the total > > > number of columns is 5 or 1000. This results in very quick operations > > > (and, often, elegant code as well). > > > > > > > I agree that data.table is a very good alternative (for other reasons) that > > should be promoted more. The only slight snag is that it doesn't help with > > the issue at hand since it simply does a pass-though for subassignments to > > data frame's methods and thus suffers from the same problems (in fact there > > is a rather stark asymmetry in how it handles subsetting vs subassignment - > > which is a bit surprising [if I read the code correctly you can't use the > > same indexing in both]). In fact I would propose that it should not do that > > but handle the simple cases itself more efficiently without unneeded > > copies. That would make it indeed a very interesting alternative. > > > > Cheers, > > Simon > > > > > > > > > > On Mon, Jul 4, 2011 at 6:19 AM, ivo welch <ivo.we...@gmail.com> wrote: > > >> thank you, simon. this was very interesting indeed. I also now > > >> understand how far out of my depth I am here. > > >> > > >> fortunately, as an end user, obviously, *I* now know how to avoid the > > >> problem. I particularly like the as.list() transformation and back to > > >> as.data.frame() to speed things up without loss of (much) > > >> functionality. > > >> > > >> > > >> more broadly, I view the avoidance of individual access through the > > >> use of apply and vector operations as a mixed "IQ test" and "knowledge > > >> test" (which I often fail). However, even for the most clever, there > > >> are also situations where the KISS programming principle makes > > >> explicit loops still preferable. Personally, I would have preferred > > >> it if R had, in its standard "statistical data set" data structure, > > >> foregone the row names feature in exchange for retaining fast direct > > >> access. R could have reserved its current implementation "with row > > >> names but slow access" for a less common (possibly pseudo-inheriting) > > >> data structure. > > >> > > >> > > >> If end users commonly do iterations over a data frame, which I would > > >> guess to be the case, then the impression of R by (novice) end users > > >> could be greatly enhanced if the extreme penalties could be eliminated > > >> or at least flagged. For example, I wonder if modest special internal > > >> code could store data frames internally and transparently as lists of > > >> vectors UNTIL a row name is assigned to. Easier and uglier, a simple > > >> but specific warning message could be issued with a suggestion if > > >> there is an individual read/write into a data frame ("Warning: data > > >> frames are much slower than lists of vectors for individual element > > >> access"). > > >> > > >> > > >> I would also suggest changing the "Introduction to R" 6.3 from "A > > >> data frame may for many purposes be regarded as a matrix with columns > > >> possibly of differing modes and attributes. It may be displayed in > > >> matrix form, and its rows and columns extracted using matrix indexing > > >> conventions." to "A data frame may for many purposes be regarded as a > > >> matrix with columns possibly of differing modes and attributes. It may > > >> be displayed in matrix form, and its rows and columns extracted using > > >> matrix indexing conventions. However, data frames can be much slower > > >> than matrices or even lists of vectors (which, like data frames, can > > >> contain different types of columns) when individual elements need to > > >> be accessed." Reading about it immediately upon introduction could > > >> flag the problem in a more visible manner. > > >> > > >> > > >> regards, > > >> > > >> /iaw > > >> > > >> ______________________________________________ > > >> R-devel@r-project.org mailing list > > >> https://stat.ethz.ch/mailman/listinfo/r-devel > > >> > > > > > > ______________________________________________ > > > R-devel@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > > > > > _______________________________________________ > > datatable-help mailing list > > datatable-h...@lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > datatable-h...@lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel