Thanks for the replies and info. An attempt at fast assign is now committed to data.table v1.6.3 on R-Forge. From NEWS :
o Fast update is now implemented, FR#200. DT[i,j]<-value is now handled by data.table in C rather than falling through to data.frame methods. Thanks to Ivo Welch for raising speed issues on r-devel, to Simon Urbanek for the suggestion, and Luke Tierney and Simon for information on R internals. [<- syntax still incurs one working copy of the whole table (as of R 2.13.0) due to R's [<- dispatch mechanism copying to `*tmp*`, so, for ultimate speed and brevity, 'within' syntax is now available as follows. o A new 'within' argument has been added to [.data.table, by default TRUE. It is very similar to the within() function in base R. If an assignment appears in j, it assigns to the column of DT, by reference; e.g., DT[i,colname<-value] This syntax makes no copies of any part of memory at all. > m = matrix(1,nrow=100000,ncol=100) > DF = as.data.frame(m) > DT = as.data.table(m) > system.time(for (i in 1:1000) DF[1,1] <- 3) user system elapsed 287.730 323.196 613.453 > system.time(for (i in 1:1000) DT[1,V1 <- 3]) user system elapsed 1.152 0.004 1.161 # 528 times faster Please note : ******************************************************* ** Within syntax is presently highly experimental. ** ******************************************************* http://datatable.r-forge.r-project.org/ On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote: > On Wed, 6 Jul 2011, Simon Urbanek wrote: > > > Interesting, and I stand corrected: > > > >> x = data.frame(a=1:n,b=1:n) > >> .Internal(inspect(x)) > > @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) > > @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... > > @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... > > > >> x[1,1]=42L > >> .Internal(inspect(x)) > > @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) > > @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,... > > @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... > > > >> x[[1]][1]=42L > >> .Internal(inspect(x)) > > @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0) > > @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,... > > @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,... > > > >> x[[1]][1]=42L > >> .Internal(inspect(x)) > > @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) > > @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,... > > @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... > > > > > > I have R to release ;) so I won't be looking into this right now, but it's > > something worth investigating ... Since all the inner contents have NAMED=0 > > I would not expect any duplication to be needed, but apparently becomes so > > is at some point ... > > > The internals assume in various places that deep copies are made (one > of the reasons NAMED setings are not propagated to sub-sturcture). > The main issues are avoiding cycles and that there is no easy way to > check for sharing. There may be some circumstances in which a shallow > copy would be OK but making sure it would be in all cases is probably > more trouble than it is worth at this point. (I've tried this in the > past in a few cases and always had to back off.) > > > Best, > > luke > > > > > Cheers, > > Simon > > > > > > On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote: > > > >> > >> On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote: > >>> No subassignment function satisfies that condition, because you can > >>> always call them directly. However, that doesn't stop the default method > >>> from making that assumption, so I'm not sure it's an issue. > >>> > >>> David, Just to clarify - the data frame content is not copied, we are > >>> talking about the vector holding columns. > >> > >> If it is just the vector holding the columns that is copied (and not the > >> columns themselves), why does n make a difference in this test (on R > >> 2.13.0)? > >> > >>> n = 1000 > >>> x = data.frame(a=1:n,b=1:n) > >>> system.time(for (i in 1:1000) x[1,1] <- 42L) > >> user system elapsed > >> 0.628 0.000 0.628 > >>> n = 100000 > >>> x = data.frame(a=1:n,b=1:n) # still 2 columns, but longer columns > >>> system.time(for (i in 1:1000) x[1,1] <- 42L) > >> user system elapsed > >> 20.145 1.232 21.455 > >>> > >> > >> With $<- : > >> > >>> n = 1000 > >>> x = data.frame(a=1:n,b=1:n) > >>> system.time(for (i in 1:1000) x$a[1] <- 42L) > >> user system elapsed > >> 0.304 0.000 0.307 > >>> n = 100000 > >>> x = data.frame(a=1:n,b=1:n) > >>> system.time(for (i in 1:1000) x$a[1] <- 42L) > >> user system elapsed > >> 37.586 0.388 38.161 > >>> > >> > >> If it's because the 1st column needs to be copied (only) because that's > >> the one being assigned to (in this test), that magnitude of slow down > >> doesn't seem consistent with the time of a vector copy of the 1st > >> column : > >> > >>> n=100000 > >>> v = 1:n > >>> system.time(for (i in 1:1000) v[1] <- 42L) > >> user system elapsed > >> 0.016 0.000 0.017 > >>> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L}) > >> user system elapsed > >> 1.816 1.076 2.900 > >> > >> Finally, increasing the number of columns, again only the 1st is > >> assigned to : > >> > >>> n=100000 > >>> x = data.frame(rep(list(1:n),100)) > >>> dim(x) > >> [1] 100000 100 > >>> system.time(for (i in 1:1000) x[1,1] <- 42L) > >> user system elapsed > >> 167.974 50.903 219.711 > >>> > >> > >> > >> > >>> > >>> Cheers, > >>> Simon > >>> > >>> Sent from my iPhone > >>> > >>> On Jul 5, 2011, at 9:01 PM, David Winsemius <dwinsem...@comcast.net> > >>> wrote: > >>> > >>>> > >>>> On Jul 5, 2011, at 7:18 PM, <luke-tier...@uiowa.edu> > >>>> <luke-tier...@uiowa.edu> wrote: > >>>> > >>>>> On Tue, 5 Jul 2011, Simon Urbanek wrote: > >>>>> > >>>>>> > >>>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: > >>>>>> > >>>>>>> Simon (and all), > >>>>>>> > >>>>>>> I've tried to make assignment as fast as calling `[<-.data.table` > >>>>>>> directly, for user convenience. Profiling shows (IIUC) that it isn't > >>>>>>> dispatch, but x being copied. Is there a way to prevent '[<-' from > >>>>>>> copying x? > >>>>>> > >>>>>> Good point, and conceptually, no. It's a subassignment after all - see > >>>>>> R-lang 3.4.4 - it is equivalent to > >>>>>> > >>>>>> `*tmp*` <- x > >>>>>> x <- `[<-`(`*tmp*`, i, j, value) > >>>>>> rm(`*tmp*`) > >>>>>> > >>>>>> so there is always a copy involved. > >>>>>> > >>>>>> Now, a conceptual copy doesn't mean real copy in R since R tries to > >>>>>> keep the pass-by-value illusion while passing references in cases > >>>>>> where it knows that modifications cannot occur and/or they are safe. > >>>>>> The default subassign method uses that feature which means it can > >>>>>> afford to not duplicate if there is only one reference -- then it's > >>>>>> safe to not duplicate as we are replacing that only existing > >>>>>> reference. And in the case of a matrix, that will be true at the > >>>>>> latest from the second subassignment on. > >>>>>> > >>>>>> Unfortunately the method dispatch (AFAICS) introduces one more > >>>>>> reference in the dispatch chain so there will always be two references > >>>>>> so duplication is necessary. Since we have only 0 / 1 / 2+ information > >>>>>> on the references, we can't distinguish whether the second reference > >>>>>> is due to the dispatch or due to the passed object having more than > >>>>>> one reference, so we have to duplicate in any case. That is > >>>>>> unfortunate, and I don't see a way around (unless we handle > >>>>>> subassignment methods is some special way). > >>>>> > >>>>> I don't believe dispatch is bumping NAMED (and a quick experiment > >>>>> seems to confirm this though I don't guarantee I did that right). The > >>>>> issue is that a replacement function implemented as a closure, which > >>>>> is the only option for a package, will always see NAMED on the object > >>>>> to be modified as 2 (because the value is obtained by forcing the > >>>>> argument promise) and so any R level assignments will duplicate. This > >>>>> also isn't really an issue of imprecise reference counting -- there > >>>>> really are (at least) two legitimate references -- one though the > >>>>> argument and one through the caller's environment. > >>>>> > >>>>> It would be good it we could come up with a way for packages to be > >>>>> able to define replacement functions that do not duplicate in cases > >>>>> where we really don't want them to, but this would require coming up > >>>>> with some sort of protocol, minimally involving an efficient way to > >>>>> detect whether a replacement funciton is being called in a replacement > >>>>> context or directly. > >>>> > >>>> Would "$<-" always satisfy that condition. It would be big help to me if > >>>> it could be designed to avoid duplication the rest of the data.frame. > >>>> > >>>> -- > >>>> > >>>>> > >>>>> There are some replacement functions that use C code to cheat, but > >>>>> these may create problems if called directly, so I won't advertise > >>>>> them. > >>>>> > >>>>> Best, > >>>>> > >>>>> luke > >>>>> > >>>>>> > >>>>>> Cheers, > >>>>>> Simon > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> -- > >>>>> Luke Tierney > >>>>> Statistics and Actuarial Science > >>>>> Ralph E. Wareham Professor of Mathematical Sciences > >>>>> University of Iowa Phone: 319-335-3386 > >>>>> Department of Statistics and Fax: 319-335-3017 > >>>>> Actuarial Science > >>>>> 241 Schaeffer Hall email: l...@stat.uiowa.edu > >>>>> Iowa City, IA 52242 WWW: > >>>>> http://www.stat.uiowa.edu______________________________________________ > >>>>> R-devel@r-project.org mailing list > >>>>> https://stat.ethz.ch/mailman/listinfo/r-devel > >>>> > >>>> David Winsemius, MD > >>>> West Hartford, CT > >>>> > >>>> > >> > >> > >> > > > > > > -- > Luke Tierney > Statistics and Actuarial Science > Ralph E. Wareham Professor of Mathematical Sciences > University of Iowa Phone: 319-335-3386 > Department of Statistics and Fax: 319-335-3017 > Actuarial Science > 241 Schaeffer Hall email: l...@stat.uiowa.edu > Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel