Interesting, and I stand corrected: > x = data.frame(a=1:n,b=1:n) > .Internal(inspect(x)) @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
> x[1,1]=42L > .Internal(inspect(x)) @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,... @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... > x[[1]][1]=42L > .Internal(inspect(x)) @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0) @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,... @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,... > x[[1]][1]=42L > .Internal(inspect(x)) @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0) @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,... @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,... I have R to release ;) so I won't be looking into this right now, but it's something worth investigating ... Since all the inner contents have NAMED=0 I would not expect any duplication to be needed, but apparently becomes so is at some point ... Cheers, Simon On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote: > > On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote: >> No subassignment function satisfies that condition, because you can always >> call them directly. However, that doesn't stop the default method from >> making that assumption, so I'm not sure it's an issue. >> >> David, Just to clarify - the data frame content is not copied, we are >> talking about the vector holding columns. > > If it is just the vector holding the columns that is copied (and not the > columns themselves), why does n make a difference in this test (on R > 2.13.0)? > >> n = 1000 >> x = data.frame(a=1:n,b=1:n) >> system.time(for (i in 1:1000) x[1,1] <- 42L) > user system elapsed > 0.628 0.000 0.628 >> n = 100000 >> x = data.frame(a=1:n,b=1:n) # still 2 columns, but longer columns >> system.time(for (i in 1:1000) x[1,1] <- 42L) > user system elapsed > 20.145 1.232 21.455 >> > > With $<- : > >> n = 1000 >> x = data.frame(a=1:n,b=1:n) >> system.time(for (i in 1:1000) x$a[1] <- 42L) > user system elapsed > 0.304 0.000 0.307 >> n = 100000 >> x = data.frame(a=1:n,b=1:n) >> system.time(for (i in 1:1000) x$a[1] <- 42L) > user system elapsed > 37.586 0.388 38.161 >> > > If it's because the 1st column needs to be copied (only) because that's > the one being assigned to (in this test), that magnitude of slow down > doesn't seem consistent with the time of a vector copy of the 1st > column : > >> n=100000 >> v = 1:n >> system.time(for (i in 1:1000) v[1] <- 42L) > user system elapsed > 0.016 0.000 0.017 >> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L}) > user system elapsed > 1.816 1.076 2.900 > > Finally, increasing the number of columns, again only the 1st is > assigned to : > >> n=100000 >> x = data.frame(rep(list(1:n),100)) >> dim(x) > [1] 100000 100 >> system.time(for (i in 1:1000) x[1,1] <- 42L) > user system elapsed > 167.974 50.903 219.711 >> > > > >> >> Cheers, >> Simon >> >> Sent from my iPhone >> >> On Jul 5, 2011, at 9:01 PM, David Winsemius <dwinsem...@comcast.net> wrote: >> >>> >>> On Jul 5, 2011, at 7:18 PM, <luke-tier...@uiowa.edu> >>> <luke-tier...@uiowa.edu> wrote: >>> >>>> On Tue, 5 Jul 2011, Simon Urbanek wrote: >>>> >>>>> >>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote: >>>>> >>>>>> Simon (and all), >>>>>> >>>>>> I've tried to make assignment as fast as calling `[<-.data.table` >>>>>> directly, for user convenience. Profiling shows (IIUC) that it isn't >>>>>> dispatch, but x being copied. Is there a way to prevent '[<-' from >>>>>> copying x? >>>>> >>>>> Good point, and conceptually, no. It's a subassignment after all - see >>>>> R-lang 3.4.4 - it is equivalent to >>>>> >>>>> `*tmp*` <- x >>>>> x <- `[<-`(`*tmp*`, i, j, value) >>>>> rm(`*tmp*`) >>>>> >>>>> so there is always a copy involved. >>>>> >>>>> Now, a conceptual copy doesn't mean real copy in R since R tries to keep >>>>> the pass-by-value illusion while passing references in cases where it >>>>> knows that modifications cannot occur and/or they are safe. The default >>>>> subassign method uses that feature which means it can afford to not >>>>> duplicate if there is only one reference -- then it's safe to not >>>>> duplicate as we are replacing that only existing reference. And in the >>>>> case of a matrix, that will be true at the latest from the second >>>>> subassignment on. >>>>> >>>>> Unfortunately the method dispatch (AFAICS) introduces one more reference >>>>> in the dispatch chain so there will always be two references so >>>>> duplication is necessary. Since we have only 0 / 1 / 2+ information on >>>>> the references, we can't distinguish whether the second reference is due >>>>> to the dispatch or due to the passed object having more than one >>>>> reference, so we have to duplicate in any case. That is unfortunate, and >>>>> I don't see a way around (unless we handle subassignment methods is some >>>>> special way). >>>> >>>> I don't believe dispatch is bumping NAMED (and a quick experiment >>>> seems to confirm this though I don't guarantee I did that right). The >>>> issue is that a replacement function implemented as a closure, which >>>> is the only option for a package, will always see NAMED on the object >>>> to be modified as 2 (because the value is obtained by forcing the >>>> argument promise) and so any R level assignments will duplicate. This >>>> also isn't really an issue of imprecise reference counting -- there >>>> really are (at least) two legitimate references -- one though the >>>> argument and one through the caller's environment. >>>> >>>> It would be good it we could come up with a way for packages to be >>>> able to define replacement functions that do not duplicate in cases >>>> where we really don't want them to, but this would require coming up >>>> with some sort of protocol, minimally involving an efficient way to >>>> detect whether a replacement funciton is being called in a replacement >>>> context or directly. >>> >>> Would "$<-" always satisfy that condition. It would be big help to me if it >>> could be designed to avoid duplication the rest of the data.frame. >>> >>> -- >>> >>>> >>>> There are some replacement functions that use C code to cheat, but >>>> these may create problems if called directly, so I won't advertise >>>> them. >>>> >>>> Best, >>>> >>>> luke >>>> >>>>> >>>>> Cheers, >>>>> Simon >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Luke Tierney >>>> Statistics and Actuarial Science >>>> Ralph E. Wareham Professor of Mathematical Sciences >>>> University of Iowa Phone: 319-335-3386 >>>> Department of Statistics and Fax: 319-335-3017 >>>> Actuarial Science >>>> 241 Schaeffer Hall email: l...@stat.uiowa.edu >>>> Iowa City, IA 52242 WWW: >>>> http://www.stat.uiowa.edu______________________________________________ >>>> R-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >>> David Winsemius, MD >>> West Hartford, CT >>> >>> > > > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel