Simon, If you didn't install.packages() with method="source" from R-Forge, that would explain (some of) it. R-Forge builds binaries once each night. This commit was long after the cutoff. Matthew
>> Matthew, >> >> I was hoping I misunderstood you first proposal, but I suspect I did not >> ;). >> >> Personally, I find DT[1,V1 <- 3] highly disturbing - I would expect it >> to >> evaluate to >> { V1 <- 3; DT[1, V1] } >> thus returning the first element of the third column. > > Please see FAQ 1.1, since further below it seems to be an expectation > issue about 'with' syntax, too. > >> >> That said, I don't think it works, either. Taking you example and >> data.table form r-forge: > [ snip ] >> as you can see, DT is not modified. > > Works for me on R 2.13.0. I'll try latest R later. If I can't reproduce > the non-working state I'll need some more environment information please. > >> Also I suspect there is something quite amiss because even trivial >> things >> don't work: >> >>> DF[1:4,1:4] >> V1 V2 V3 V4 >> 1 3 1 1 1 >> 2 1 1 1 1 >> 3 1 1 1 1 >> 4 1 1 1 1 >>> DT[1:4,1:4] >> [1] 1 2 3 4 > > That's correct and fundamental to data.table. See FAQs 1.1, 1.7, 1.8, 1.9 > and 1.10. > >> >> When I first saw your proposal, I thought you have rather something like >> within(DT, V1[1] <- 3) >> in mind which looks innocent enough but performs terribly (note that I >> had >> to scale down the loop by a factor of 100!!!): >> >>> system.time(for (i in 1:10) within(DT, V1[1] <- 3)) >> user system elapsed >> 2.701 4.437 7.138 > > No, since 'with' is already built into data.table, I was thinking of > building 'within' in, too. I'll take a look at within(). Might as well > provide as many options as possible to the user to use as they wish. > >> With the for loop something like within(DF, for (i in 1:1000) V1[i] <- >> 3)) >> performs reasonably: >> >>> system.time(within(DT, for (i in 1:1000) V1[i] <- 3)) >> user system elapsed >> 0.392 0.613 1.003 >> >> (Note: system.time() can be misleading when within() is involved, >> because >> the expression is evaluated in a different environment so within() won't >> actually change the object in the global environment - it also >> interacts >> with the possible duplication) > > Noted, thanks. That's pretty fast. Does within() on data.frame fix the > original issue Ivo raised, then? If so, job done. > >> >> Cheers, >> Simon >> >> On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote: >> >>> Thanks for the replies and info. An attempt at fast >>> assign is now committed to data.table v1.6.3 on >>> R-Forge. From NEWS : >>> >>> o Fast update is now implemented, FR#200. >>> DT[i,j]<-value is now handled by data.table in C rather >>> than falling through to data.frame methods. >>> >>> Thanks to Ivo Welch for raising speed issues on r-devel, >>> to Simon Urbanek for the suggestion, and Luke Tierney and >>> Simon for information on R internals. >>> >>> [<- syntax still incurs one working copy of the whole >>> table (as of R 2.13.0) due to R's [<- dispatch mechanism >>> copying to `*tmp*`, so, for ultimate speed and brevity, >>> 'within' syntax is now available as follows. >>> >>> o A new 'within' argument has been added to [.data.table, >>> by default TRUE. It is very similar to the within() >>> function in base R. If an assignment appears in j, it >>> assigns to the column of DT, by reference; e.g., >>> >>> DT[i,colname<-value] >>> >>> This syntax makes no copies of any part of memory at all. >>> >>>> m = matrix(1,nrow=100000,ncol=100) >>>> DF = as.data.frame(m) >>>> DT = as.data.table(m) >>>> system.time(for (i in 1:1000) DF[1,1] <- 3) >>> user system elapsed >>> 287.730 323.196 613.453 >>>> system.time(for (i in 1:1000) DT[1,V1 <- 3]) >>> user system elapsed >>> 1.152 0.004 1.161 # 528 times faster >>> >>> Please note : >>> >>> ******************************************************* >>> ** Within syntax is presently highly experimental. ** >>> ******************************************************* >>> >>> http://datatable.r-forge.r-project.org/ >>> >>> >>> On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote: >>>> On Wed, 6 Jul 2011, Simon Urbanek wrote: >>>> >>>>> Interesting, and I stand corrected: >>>>> >>>>>> x = data.frame(a=1:n,b=1:n) >>>>>> .Internal(inspect(x)) >>>>> @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) >>>>> @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... >>>>> @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... >>>>> >>>>>> x[1,1]=42L >>>>>> .Internal(inspect(x)) >>>>> @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) >>>>> @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,... >>>>> @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... >>>>> >>>>>> x[[1]][1]=42L >>>>>> .Internal(inspect(x)) >>>>> @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0) >>>>> @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,... >>>>> @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,... >>>>> >>>>>> x[[1]][1]=42L >>>>>> .Internal(inspect(x)) >>>>> @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) >>>>> @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,... >>>>> @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... >>>>> >>>>> >>>>> I have R to release ;) so I won't be looking into this right now, but >>>>> it's something worth investigating ... Since all the inner contents >>>>> have NAMED=0 I would not expect any duplication to be needed, but >>>>> apparently becomes so is at some point ... >>>> >>>> >>>> The internals assume in various places that deep copies are made (one >>>> of the reasons NAMED setings are not propagated to sub-sturcture). >>>> The main issues are avoiding cycles and that there is no easy way to >>>> check for sharing. There may be some circumstances in which a shallow >>>> copy would be OK but making sure it would be in all cases is probably >>>> more trouble than it is worth at this point. (I've tried this in the >>>> past in a few cases and always had to back off.) >>>> >>>> >>>> Best, >>>> >>>> luke >>>> >>>>> >>>>> Cheers, >>>>> Simon >>>>> >>>>> >>>>> On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote: >>>>> >>>>>> >>>>>> On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote: >>>>>>> No subassignment function satisfies that condition, because you can >>>>>>> always call them directly. However, that doesn't stop the default >>>>>>> method from making that assumption, so I'm not sure it's an issue. >>>>>>> >>>>>>> David, Just to clarify - the data frame content is not copied, we >>>>>>> are talking about the vector holding columns. >>>>>> >>>>>> If it is just the vector holding the columns that is copied (and not >>>>>> the >>>>>> columns themselves), why does n make a difference in this test (on R >>>>>> 2.13.0)? >>>>>> >>>>>>> n = 1000 >>>>>>> x = data.frame(a=1:n,b=1:n) >>>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L) >>>>>> user system elapsed >>>>>> 0.628 0.000 0.628 >>>>>>> n = 100000 >>>>>>> x = data.frame(a=1:n,b=1:n) # still 2 columns, but longer >>>>>>> columns >>>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L) >>>>>> user system elapsed >>>>>> 20.145 1.232 21.455 >>>>>>> >>>>>> >>>>>> With $<- : >>>>>> >>>>>>> n = 1000 >>>>>>> x = data.frame(a=1:n,b=1:n) >>>>>>> system.time(for (i in 1:1000) x$a[1] <- 42L) >>>>>> user system elapsed >>>>>> 0.304 0.000 0.307 >>>>>>> n = 100000 >>>>>>> x = data.frame(a=1:n,b=1:n) >>>>>>> system.time(for (i in 1:1000) x$a[1] <- 42L) >>>>>> user system elapsed >>>>>> 37.586 0.388 38.161 >>>>>>> >>>>>> >>>>>> If it's because the 1st column needs to be copied (only) because >>>>>> that's >>>>>> the one being assigned to (in this test), that magnitude of slow >>>>>> down >>>>>> doesn't seem consistent with the time of a vector copy of the 1st >>>>>> column : >>>>>> >>>>>>> n=100000 >>>>>>> v = 1:n >>>>>>> system.time(for (i in 1:1000) v[1] <- 42L) >>>>>> user system elapsed >>>>>> 0.016 0.000 0.017 >>>>>>> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L}) >>>>>> user system elapsed >>>>>> 1.816 1.076 2.900 >>>>>> >>>>>> Finally, increasing the number of columns, again only the 1st is >>>>>> assigned to : >>>>>> >>>>>>> n=100000 >>>>>>> x = data.frame(rep(list(1:n),100)) >>>>>>> dim(x) >>>>>> [1] 100000 100 >>>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L) >>>>>> user system elapsed >>>>>> 167.974 50.903 219.711 >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> Cheers, >>>>>>> Simon >>>>>>> >>>>>>> Sent from my iPhone >>>>>>> >>>>>>> On Jul 5, 2011, at 9:01 PM, David Winsemius >>>>>>> <dwinsem...@comcast.net> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> On Jul 5, 2011, at 7:18 PM, <luke-tier...@uiowa.edu> >>>>>>>> <luke-tier...@uiowa.edu> wrote: >>>>>>>> >>>>>>>>> On Tue, 5 Jul 2011, Simon Urbanek wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: >>>>>>>>>> >>>>>>>>>>> Simon (and all), >>>>>>>>>>> >>>>>>>>>>> I've tried to make assignment as fast as calling >>>>>>>>>>> `[<-.data.table` >>>>>>>>>>> directly, for user convenience. Profiling shows (IIUC) that it >>>>>>>>>>> isn't >>>>>>>>>>> dispatch, but x being copied. Is there a way to prevent '[<-' >>>>>>>>>>> from >>>>>>>>>>> copying x? >>>>>>>>>> >>>>>>>>>> Good point, and conceptually, no. It's a subassignment after all >>>>>>>>>> - see R-lang 3.4.4 - it is equivalent to >>>>>>>>>> >>>>>>>>>> `*tmp*` <- x >>>>>>>>>> x <- `[<-`(`*tmp*`, i, j, value) >>>>>>>>>> rm(`*tmp*`) >>>>>>>>>> >>>>>>>>>> so there is always a copy involved. >>>>>>>>>> >>>>>>>>>> Now, a conceptual copy doesn't mean real copy in R since R tries >>>>>>>>>> to keep the pass-by-value illusion while passing references in >>>>>>>>>> cases where it knows that modifications cannot occur and/or they >>>>>>>>>> are safe. The default subassign method uses that feature which >>>>>>>>>> means it can afford to not duplicate if there is only one >>>>>>>>>> reference -- then it's safe to not duplicate as we are replacing >>>>>>>>>> that only existing reference. And in the case of a matrix, that >>>>>>>>>> will be true at the latest from the second subassignment on. >>>>>>>>>> >>>>>>>>>> Unfortunately the method dispatch (AFAICS) introduces one more >>>>>>>>>> reference in the dispatch chain so there will always be two >>>>>>>>>> references so duplication is necessary. Since we have only 0 / 1 >>>>>>>>>> / 2+ information on the references, we can't distinguish whether >>>>>>>>>> the second reference is due to the dispatch or due to the passed >>>>>>>>>> object having more than one reference, so we have to duplicate >>>>>>>>>> in >>>>>>>>>> any case. That is unfortunate, and I don't see a way around >>>>>>>>>> (unless we handle subassignment methods is some special way). >>>>>>>>> >>>>>>>>> I don't believe dispatch is bumping NAMED (and a quick experiment >>>>>>>>> seems to confirm this though I don't guarantee I did that right). >>>>>>>>> The >>>>>>>>> issue is that a replacement function implemented as a closure, >>>>>>>>> which >>>>>>>>> is the only option for a package, will always see NAMED on the >>>>>>>>> object >>>>>>>>> to be modified as 2 (because the value is obtained by forcing the >>>>>>>>> argument promise) and so any R level assignments will duplicate. >>>>>>>>> This >>>>>>>>> also isn't really an issue of imprecise reference counting -- >>>>>>>>> there >>>>>>>>> really are (at least) two legitimate references -- one though the >>>>>>>>> argument and one through the caller's environment. >>>>>>>>> >>>>>>>>> It would be good it we could come up with a way for packages to >>>>>>>>> be >>>>>>>>> able to define replacement functions that do not duplicate in >>>>>>>>> cases >>>>>>>>> where we really don't want them to, but this would require coming >>>>>>>>> up >>>>>>>>> with some sort of protocol, minimally involving an efficient way >>>>>>>>> to >>>>>>>>> detect whether a replacement funciton is being called in a >>>>>>>>> replacement >>>>>>>>> context or directly. >>>>>>>> >>>>>>>> Would "$<-" always satisfy that condition. It would be big help to >>>>>>>> me if it could be designed to avoid duplication the rest of the >>>>>>>> data.frame. >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>>> >>>>>>>>> There are some replacement functions that use C code to cheat, >>>>>>>>> but >>>>>>>>> these may create problems if called directly, so I won't >>>>>>>>> advertise >>>>>>>>> them. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> luke >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Simon >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Luke Tierney >>>>>>>>> Statistics and Actuarial Science >>>>>>>>> Ralph E. Wareham Professor of Mathematical Sciences >>>>>>>>> University of Iowa Phone: >>>>>>>>> 319-335-3386 >>>>>>>>> Department of Statistics and Fax: >>>>>>>>> 319-335-3017 >>>>>>>>> Actuarial Science >>>>>>>>> 241 Schaeffer Hall email: >>>>>>>>> l...@stat.uiowa.edu >>>>>>>>> Iowa City, IA 52242 WWW: >>>>>>>>> http://www.stat.uiowa.edu______________________________________________ >>>>>>>>> R-devel@r-project.org mailing list >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>>>>>> >>>>>>>> David Winsemius, MD >>>>>>>> West Hartford, CT >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> -- >>>> Luke Tierney >>>> Statistics and Actuarial Science >>>> Ralph E. Wareham Professor of Mathematical Sciences >>>> University of Iowa Phone: 319-335-3386 >>>> Department of Statistics and Fax: 319-335-3017 >>>> Actuarial Science >>>> 241 Schaeffer Hall email: l...@stat.uiowa.edu >>>> Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu >>> >>> >>> >> >> > > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel