Matthew, I was hoping I misunderstood you first proposal, but I suspect I did not ;).
Personally, I find DT[1,V1 <- 3] highly disturbing - I would expect it to evaluate to { V1 <- 3; DT[1, V1] } thus returning the first element of the third column. I do understand that within(foo, expr, ...) was the motivation for passing expressions, but unlike within() the subsetting operator [ is not expected to take expression as its second argument. Such abuse is quite unexpected and I would say dangerous. That said, I don't think it works, either. Taking you example and data.table form r-forge: > m = matrix(1,nrow=100000,ncol=100) > DF = as.data.frame(m) > DT = as.data.table(m) > for (i in 1:1000) DT[1,V1 <- 3] > DT[1,] V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 [1,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 as you can see, DT is not modified. Also I suspect there is something quite amiss because even trivial things don't work: > DF[1:4,1:4] V1 V2 V3 V4 1 3 1 1 1 2 1 1 1 1 3 1 1 1 1 4 1 1 1 1 > DT[1:4,1:4] [1] 1 2 3 4 When I first saw your proposal, I thought you have rather something like within(DT, V1[1] <- 3) in mind which looks innocent enough but performs terribly (note that I had to scale down the loop by a factor of 100!!!): > system.time(for (i in 1:10) within(DT, V1[1] <- 3)) user system elapsed 2.701 4.437 7.138 With the for loop something like within(DF, for (i in 1:1000) V1[i] <- 3)) performs reasonably: > system.time(within(DT, for (i in 1:1000) V1[i] <- 3)) user system elapsed 0.392 0.613 1.003 (Note: system.time() can be misleading when within() is involved, because the expression is evaluated in a different environment so within() won't actually change the object in the global environment - it also interacts with the possible duplication) Cheers, Simon On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote: > Thanks for the replies and info. An attempt at fast > assign is now committed to data.table v1.6.3 on > R-Forge. From NEWS : > > o Fast update is now implemented, FR#200. > DT[i,j]<-value is now handled by data.table in C rather > than falling through to data.frame methods. > > Thanks to Ivo Welch for raising speed issues on r-devel, > to Simon Urbanek for the suggestion, and Luke Tierney and > Simon for information on R internals. > > [<- syntax still incurs one working copy of the whole > table (as of R 2.13.0) due to R's [<- dispatch mechanism > copying to `*tmp*`, so, for ultimate speed and brevity, > 'within' syntax is now available as follows. > > o A new 'within' argument has been added to [.data.table, > by default TRUE. It is very similar to the within() > function in base R. If an assignment appears in j, it > assigns to the column of DT, by reference; e.g., > > DT[i,colname<-value] > > This syntax makes no copies of any part of memory at all. > >> m = matrix(1,nrow=100000,ncol=100) >> DF = as.data.frame(m) >> DT = as.data.table(m) >> system.time(for (i in 1:1000) DF[1,1] <- 3) > user system elapsed > 287.730 323.196 613.453 >> system.time(for (i in 1:1000) DT[1,V1 <- 3]) > user system elapsed > 1.152 0.004 1.161 # 528 times faster > > Please note : > > ******************************************************* > ** Within syntax is presently highly experimental. ** > ******************************************************* > > http://datatable.r-forge.r-project.org/ > > > On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote: >> On Wed, 6 Jul 2011, Simon Urbanek wrote: >> >>> Interesting, and I stand corrected: >>> >>>> x = data.frame(a=1:n,b=1:n) >>>> .Internal(inspect(x)) >>> @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) >>> @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... >>> @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... >>> >>>> x[1,1]=42L >>>> .Internal(inspect(x)) >>> @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) >>> @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,... >>> @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... >>> >>>> x[[1]][1]=42L >>>> .Internal(inspect(x)) >>> @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0) >>> @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,... >>> @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,... >>> >>>> x[[1]][1]=42L >>>> .Internal(inspect(x)) >>> @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) >>> @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,... >>> @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... >>> >>> >>> I have R to release ;) so I won't be looking into this right now, but it's >>> something worth investigating ... Since all the inner contents have NAMED=0 >>> I would not expect any duplication to be needed, but apparently becomes so >>> is at some point ... >> >> >> The internals assume in various places that deep copies are made (one >> of the reasons NAMED setings are not propagated to sub-sturcture). >> The main issues are avoiding cycles and that there is no easy way to >> check for sharing. There may be some circumstances in which a shallow >> copy would be OK but making sure it would be in all cases is probably >> more trouble than it is worth at this point. (I've tried this in the >> past in a few cases and always had to back off.) >> >> >> Best, >> >> luke >> >>> >>> Cheers, >>> Simon >>> >>> >>> On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote: >>> >>>> >>>> On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote: >>>>> No subassignment function satisfies that condition, because you can >>>>> always call them directly. However, that doesn't stop the default method >>>>> from making that assumption, so I'm not sure it's an issue. >>>>> >>>>> David, Just to clarify - the data frame content is not copied, we are >>>>> talking about the vector holding columns. >>>> >>>> If it is just the vector holding the columns that is copied (and not the >>>> columns themselves), why does n make a difference in this test (on R >>>> 2.13.0)? >>>> >>>>> n = 1000 >>>>> x = data.frame(a=1:n,b=1:n) >>>>> system.time(for (i in 1:1000) x[1,1] <- 42L) >>>> user system elapsed >>>> 0.628 0.000 0.628 >>>>> n = 100000 >>>>> x = data.frame(a=1:n,b=1:n) # still 2 columns, but longer columns >>>>> system.time(for (i in 1:1000) x[1,1] <- 42L) >>>> user system elapsed >>>> 20.145 1.232 21.455 >>>>> >>>> >>>> With $<- : >>>> >>>>> n = 1000 >>>>> x = data.frame(a=1:n,b=1:n) >>>>> system.time(for (i in 1:1000) x$a[1] <- 42L) >>>> user system elapsed >>>> 0.304 0.000 0.307 >>>>> n = 100000 >>>>> x = data.frame(a=1:n,b=1:n) >>>>> system.time(for (i in 1:1000) x$a[1] <- 42L) >>>> user system elapsed >>>> 37.586 0.388 38.161 >>>>> >>>> >>>> If it's because the 1st column needs to be copied (only) because that's >>>> the one being assigned to (in this test), that magnitude of slow down >>>> doesn't seem consistent with the time of a vector copy of the 1st >>>> column : >>>> >>>>> n=100000 >>>>> v = 1:n >>>>> system.time(for (i in 1:1000) v[1] <- 42L) >>>> user system elapsed >>>> 0.016 0.000 0.017 >>>>> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L}) >>>> user system elapsed >>>> 1.816 1.076 2.900 >>>> >>>> Finally, increasing the number of columns, again only the 1st is >>>> assigned to : >>>> >>>>> n=100000 >>>>> x = data.frame(rep(list(1:n),100)) >>>>> dim(x) >>>> [1] 100000 100 >>>>> system.time(for (i in 1:1000) x[1,1] <- 42L) >>>> user system elapsed >>>> 167.974 50.903 219.711 >>>>> >>>> >>>> >>>> >>>>> >>>>> Cheers, >>>>> Simon >>>>> >>>>> Sent from my iPhone >>>>> >>>>> On Jul 5, 2011, at 9:01 PM, David Winsemius <dwinsem...@comcast.net> >>>>> wrote: >>>>> >>>>>> >>>>>> On Jul 5, 2011, at 7:18 PM, <luke-tier...@uiowa.edu> >>>>>> <luke-tier...@uiowa.edu> wrote: >>>>>> >>>>>>> On Tue, 5 Jul 2011, Simon Urbanek wrote: >>>>>>> >>>>>>>> >>>>>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: >>>>>>>> >>>>>>>>> Simon (and all), >>>>>>>>> >>>>>>>>> I've tried to make assignment as fast as calling `[<-.data.table` >>>>>>>>> directly, for user convenience. Profiling shows (IIUC) that it isn't >>>>>>>>> dispatch, but x being copied. Is there a way to prevent '[<-' from >>>>>>>>> copying x? >>>>>>>> >>>>>>>> Good point, and conceptually, no. It's a subassignment after all - see >>>>>>>> R-lang 3.4.4 - it is equivalent to >>>>>>>> >>>>>>>> `*tmp*` <- x >>>>>>>> x <- `[<-`(`*tmp*`, i, j, value) >>>>>>>> rm(`*tmp*`) >>>>>>>> >>>>>>>> so there is always a copy involved. >>>>>>>> >>>>>>>> Now, a conceptual copy doesn't mean real copy in R since R tries to >>>>>>>> keep the pass-by-value illusion while passing references in cases >>>>>>>> where it knows that modifications cannot occur and/or they are safe. >>>>>>>> The default subassign method uses that feature which means it can >>>>>>>> afford to not duplicate if there is only one reference -- then it's >>>>>>>> safe to not duplicate as we are replacing that only existing >>>>>>>> reference. And in the case of a matrix, that will be true at the >>>>>>>> latest from the second subassignment on. >>>>>>>> >>>>>>>> Unfortunately the method dispatch (AFAICS) introduces one more >>>>>>>> reference in the dispatch chain so there will always be two references >>>>>>>> so duplication is necessary. Since we have only 0 / 1 / 2+ information >>>>>>>> on the references, we can't distinguish whether the second reference >>>>>>>> is due to the dispatch or due to the passed object having more than >>>>>>>> one reference, so we have to duplicate in any case. That is >>>>>>>> unfortunate, and I don't see a way around (unless we handle >>>>>>>> subassignment methods is some special way). >>>>>>> >>>>>>> I don't believe dispatch is bumping NAMED (and a quick experiment >>>>>>> seems to confirm this though I don't guarantee I did that right). The >>>>>>> issue is that a replacement function implemented as a closure, which >>>>>>> is the only option for a package, will always see NAMED on the object >>>>>>> to be modified as 2 (because the value is obtained by forcing the >>>>>>> argument promise) and so any R level assignments will duplicate. This >>>>>>> also isn't really an issue of imprecise reference counting -- there >>>>>>> really are (at least) two legitimate references -- one though the >>>>>>> argument and one through the caller's environment. >>>>>>> >>>>>>> It would be good it we could come up with a way for packages to be >>>>>>> able to define replacement functions that do not duplicate in cases >>>>>>> where we really don't want them to, but this would require coming up >>>>>>> with some sort of protocol, minimally involving an efficient way to >>>>>>> detect whether a replacement funciton is being called in a replacement >>>>>>> context or directly. >>>>>> >>>>>> Would "$<-" always satisfy that condition. It would be big help to me if >>>>>> it could be designed to avoid duplication the rest of the data.frame. >>>>>> >>>>>> -- >>>>>> >>>>>>> >>>>>>> There are some replacement functions that use C code to cheat, but >>>>>>> these may create problems if called directly, so I won't advertise >>>>>>> them. >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> luke >>>>>>> >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Simon >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Luke Tierney >>>>>>> Statistics and Actuarial Science >>>>>>> Ralph E. Wareham Professor of Mathematical Sciences >>>>>>> University of Iowa Phone: 319-335-3386 >>>>>>> Department of Statistics and Fax: 319-335-3017 >>>>>>> Actuarial Science >>>>>>> 241 Schaeffer Hall email: l...@stat.uiowa.edu >>>>>>> Iowa City, IA 52242 WWW: >>>>>>> http://www.stat.uiowa.edu______________________________________________ >>>>>>> R-devel@r-project.org mailing list >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>>>> >>>>>> David Winsemius, MD >>>>>> West Hartford, CT >>>>>> >>>>>> >>>> >>>> >>>> >>> >>> >> >> -- >> Luke Tierney >> Statistics and Actuarial Science >> Ralph E. Wareham Professor of Mathematical Sciences >> University of Iowa Phone: 319-335-3386 >> Department of Statistics and Fax: 319-335-3017 >> Actuarial Science >> 241 Schaeffer Hall email: l...@stat.uiowa.edu >> Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu > > > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel