> Matthew, > > I was hoping I misunderstood you first proposal, but I suspect I did not > ;). > > Personally, I find DT[1,V1 <- 3] highly disturbing - I would expect it to > evaluate to > { V1 <- 3; DT[1, V1] } > thus returning the first element of the third column.
Please see FAQ 1.1, since further below it seems to be an expectation issue about 'with' syntax, too. > > That said, I don't think it works, either. Taking you example and > data.table form r-forge: [ snip ] > as you can see, DT is not modified. Works for me on R 2.13.0. I'll try latest R later. If I can't reproduce the non-working state I'll need some more environment information please. > Also I suspect there is something quite amiss because even trivial things > don't work: > >> DF[1:4,1:4] > V1 V2 V3 V4 > 1 3 1 1 1 > 2 1 1 1 1 > 3 1 1 1 1 > 4 1 1 1 1 >> DT[1:4,1:4] > [1] 1 2 3 4 That's correct and fundamental to data.table. See FAQs 1.1, 1.7, 1.8, 1.9 and 1.10. > > When I first saw your proposal, I thought you have rather something like > within(DT, V1[1] <- 3) > in mind which looks innocent enough but performs terribly (note that I had > to scale down the loop by a factor of 100!!!): > >> system.time(for (i in 1:10) within(DT, V1[1] <- 3)) > user system elapsed > 2.701 4.437 7.138 No, since 'with' is already built into data.table, I was thinking of building 'within' in, too. I'll take a look at within(). Might as well provide as many options as possible to the user to use as they wish. > With the for loop something like within(DF, for (i in 1:1000) V1[i] <- 3)) > performs reasonably: > >> system.time(within(DT, for (i in 1:1000) V1[i] <- 3)) > user system elapsed > 0.392 0.613 1.003 > > (Note: system.time() can be misleading when within() is involved, because > the expression is evaluated in a different environment so within() won't > actually change the object in the global environment - it also interacts > with the possible duplication) Noted, thanks. That's pretty fast. Does within() on data.frame fix the original issue Ivo raised, then? If so, job done. > > Cheers, > Simon > > On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote: > >> Thanks for the replies and info. An attempt at fast >> assign is now committed to data.table v1.6.3 on >> R-Forge. From NEWS : >> >> o Fast update is now implemented, FR#200. >> DT[i,j]<-value is now handled by data.table in C rather >> than falling through to data.frame methods. >> >> Thanks to Ivo Welch for raising speed issues on r-devel, >> to Simon Urbanek for the suggestion, and Luke Tierney and >> Simon for information on R internals. >> >> [<- syntax still incurs one working copy of the whole >> table (as of R 2.13.0) due to R's [<- dispatch mechanism >> copying to `*tmp*`, so, for ultimate speed and brevity, >> 'within' syntax is now available as follows. >> >> o A new 'within' argument has been added to [.data.table, >> by default TRUE. It is very similar to the within() >> function in base R. If an assignment appears in j, it >> assigns to the column of DT, by reference; e.g., >> >> DT[i,colname<-value] >> >> This syntax makes no copies of any part of memory at all. >> >>> m = matrix(1,nrow=100000,ncol=100) >>> DF = as.data.frame(m) >>> DT = as.data.table(m) >>> system.time(for (i in 1:1000) DF[1,1] <- 3) >> user system elapsed >> 287.730 323.196 613.453 >>> system.time(for (i in 1:1000) DT[1,V1 <- 3]) >> user system elapsed >> 1.152 0.004 1.161 # 528 times faster >> >> Please note : >> >> ******************************************************* >> ** Within syntax is presently highly experimental. ** >> ******************************************************* >> >> http://datatable.r-forge.r-project.org/ >> >> >> On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote: >>> On Wed, 6 Jul 2011, Simon Urbanek wrote: >>> >>>> Interesting, and I stand corrected: >>>> >>>>> x = data.frame(a=1:n,b=1:n) >>>>> .Internal(inspect(x)) >>>> @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) >>>> @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... >>>> @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... >>>> >>>>> x[1,1]=42L >>>>> .Internal(inspect(x)) >>>> @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) >>>> @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,... >>>> @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... >>>> >>>>> x[[1]][1]=42L >>>>> .Internal(inspect(x)) >>>> @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0) >>>> @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,... >>>> @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,... >>>> >>>>> x[[1]][1]=42L >>>>> .Internal(inspect(x)) >>>> @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) >>>> @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,... >>>> @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... >>>> >>>> >>>> I have R to release ;) so I won't be looking into this right now, but >>>> it's something worth investigating ... Since all the inner contents >>>> have NAMED=0 I would not expect any duplication to be needed, but >>>> apparently becomes so is at some point ... >>> >>> >>> The internals assume in various places that deep copies are made (one >>> of the reasons NAMED setings are not propagated to sub-sturcture). >>> The main issues are avoiding cycles and that there is no easy way to >>> check for sharing. There may be some circumstances in which a shallow >>> copy would be OK but making sure it would be in all cases is probably >>> more trouble than it is worth at this point. (I've tried this in the >>> past in a few cases and always had to back off.) >>> >>> >>> Best, >>> >>> luke >>> >>>> >>>> Cheers, >>>> Simon >>>> >>>> >>>> On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote: >>>> >>>>> >>>>> On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote: >>>>>> No subassignment function satisfies that condition, because you can >>>>>> always call them directly. However, that doesn't stop the default >>>>>> method from making that assumption, so I'm not sure it's an issue. >>>>>> >>>>>> David, Just to clarify - the data frame content is not copied, we >>>>>> are talking about the vector holding columns. >>>>> >>>>> If it is just the vector holding the columns that is copied (and not >>>>> the >>>>> columns themselves), why does n make a difference in this test (on R >>>>> 2.13.0)? >>>>> >>>>>> n = 1000 >>>>>> x = data.frame(a=1:n,b=1:n) >>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L) >>>>> user system elapsed >>>>> 0.628 0.000 0.628 >>>>>> n = 100000 >>>>>> x = data.frame(a=1:n,b=1:n) # still 2 columns, but longer >>>>>> columns >>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L) >>>>> user system elapsed >>>>> 20.145 1.232 21.455 >>>>>> >>>>> >>>>> With $<- : >>>>> >>>>>> n = 1000 >>>>>> x = data.frame(a=1:n,b=1:n) >>>>>> system.time(for (i in 1:1000) x$a[1] <- 42L) >>>>> user system elapsed >>>>> 0.304 0.000 0.307 >>>>>> n = 100000 >>>>>> x = data.frame(a=1:n,b=1:n) >>>>>> system.time(for (i in 1:1000) x$a[1] <- 42L) >>>>> user system elapsed >>>>> 37.586 0.388 38.161 >>>>>> >>>>> >>>>> If it's because the 1st column needs to be copied (only) because >>>>> that's >>>>> the one being assigned to (in this test), that magnitude of slow down >>>>> doesn't seem consistent with the time of a vector copy of the 1st >>>>> column : >>>>> >>>>>> n=100000 >>>>>> v = 1:n >>>>>> system.time(for (i in 1:1000) v[1] <- 42L) >>>>> user system elapsed >>>>> 0.016 0.000 0.017 >>>>>> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L}) >>>>> user system elapsed >>>>> 1.816 1.076 2.900 >>>>> >>>>> Finally, increasing the number of columns, again only the 1st is >>>>> assigned to : >>>>> >>>>>> n=100000 >>>>>> x = data.frame(rep(list(1:n),100)) >>>>>> dim(x) >>>>> [1] 100000 100 >>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L) >>>>> user system elapsed >>>>> 167.974 50.903 219.711 >>>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>> Cheers, >>>>>> Simon >>>>>> >>>>>> Sent from my iPhone >>>>>> >>>>>> On Jul 5, 2011, at 9:01 PM, David Winsemius <dwinsem...@comcast.net> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> On Jul 5, 2011, at 7:18 PM, <luke-tier...@uiowa.edu> >>>>>>> <luke-tier...@uiowa.edu> wrote: >>>>>>> >>>>>>>> On Tue, 5 Jul 2011, Simon Urbanek wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: >>>>>>>>> >>>>>>>>>> Simon (and all), >>>>>>>>>> >>>>>>>>>> I've tried to make assignment as fast as calling >>>>>>>>>> `[<-.data.table` >>>>>>>>>> directly, for user convenience. Profiling shows (IIUC) that it >>>>>>>>>> isn't >>>>>>>>>> dispatch, but x being copied. Is there a way to prevent '[<-' >>>>>>>>>> from >>>>>>>>>> copying x? >>>>>>>>> >>>>>>>>> Good point, and conceptually, no. It's a subassignment after all >>>>>>>>> - see R-lang 3.4.4 - it is equivalent to >>>>>>>>> >>>>>>>>> `*tmp*` <- x >>>>>>>>> x <- `[<-`(`*tmp*`, i, j, value) >>>>>>>>> rm(`*tmp*`) >>>>>>>>> >>>>>>>>> so there is always a copy involved. >>>>>>>>> >>>>>>>>> Now, a conceptual copy doesn't mean real copy in R since R tries >>>>>>>>> to keep the pass-by-value illusion while passing references in >>>>>>>>> cases where it knows that modifications cannot occur and/or they >>>>>>>>> are safe. The default subassign method uses that feature which >>>>>>>>> means it can afford to not duplicate if there is only one >>>>>>>>> reference -- then it's safe to not duplicate as we are replacing >>>>>>>>> that only existing reference. And in the case of a matrix, that >>>>>>>>> will be true at the latest from the second subassignment on. >>>>>>>>> >>>>>>>>> Unfortunately the method dispatch (AFAICS) introduces one more >>>>>>>>> reference in the dispatch chain so there will always be two >>>>>>>>> references so duplication is necessary. Since we have only 0 / 1 >>>>>>>>> / 2+ information on the references, we can't distinguish whether >>>>>>>>> the second reference is due to the dispatch or due to the passed >>>>>>>>> object having more than one reference, so we have to duplicate in >>>>>>>>> any case. That is unfortunate, and I don't see a way around >>>>>>>>> (unless we handle subassignment methods is some special way). >>>>>>>> >>>>>>>> I don't believe dispatch is bumping NAMED (and a quick experiment >>>>>>>> seems to confirm this though I don't guarantee I did that right). >>>>>>>> The >>>>>>>> issue is that a replacement function implemented as a closure, >>>>>>>> which >>>>>>>> is the only option for a package, will always see NAMED on the >>>>>>>> object >>>>>>>> to be modified as 2 (because the value is obtained by forcing the >>>>>>>> argument promise) and so any R level assignments will duplicate. >>>>>>>> This >>>>>>>> also isn't really an issue of imprecise reference counting -- >>>>>>>> there >>>>>>>> really are (at least) two legitimate references -- one though the >>>>>>>> argument and one through the caller's environment. >>>>>>>> >>>>>>>> It would be good it we could come up with a way for packages to be >>>>>>>> able to define replacement functions that do not duplicate in >>>>>>>> cases >>>>>>>> where we really don't want them to, but this would require coming >>>>>>>> up >>>>>>>> with some sort of protocol, minimally involving an efficient way >>>>>>>> to >>>>>>>> detect whether a replacement funciton is being called in a >>>>>>>> replacement >>>>>>>> context or directly. >>>>>>> >>>>>>> Would "$<-" always satisfy that condition. It would be big help to >>>>>>> me if it could be designed to avoid duplication the rest of the >>>>>>> data.frame. >>>>>>> >>>>>>> -- >>>>>>> >>>>>>>> >>>>>>>> There are some replacement functions that use C code to cheat, but >>>>>>>> these may create problems if called directly, so I won't advertise >>>>>>>> them. >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> luke >>>>>>>> >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Simon >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Luke Tierney >>>>>>>> Statistics and Actuarial Science >>>>>>>> Ralph E. Wareham Professor of Mathematical Sciences >>>>>>>> University of Iowa Phone: >>>>>>>> 319-335-3386 >>>>>>>> Department of Statistics and Fax: >>>>>>>> 319-335-3017 >>>>>>>> Actuarial Science >>>>>>>> 241 Schaeffer Hall email: >>>>>>>> l...@stat.uiowa.edu >>>>>>>> Iowa City, IA 52242 WWW: >>>>>>>> http://www.stat.uiowa.edu______________________________________________ >>>>>>>> R-devel@r-project.org mailing list >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>>>>> >>>>>>> David Winsemius, MD >>>>>>> West Hartford, CT >>>>>>> >>>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >>> -- >>> Luke Tierney >>> Statistics and Actuarial Science >>> Ralph E. Wareham Professor of Mathematical Sciences >>> University of Iowa Phone: 319-335-3386 >>> Department of Statistics and Fax: 319-335-3017 >>> Actuarial Science >>> 241 Schaeffer Hall email: l...@stat.uiowa.edu >>> Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu >> >> >> > > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel