Interesting, and I stand corrected:
> x = data.frame(a=1:n,b=1:n)
> .Internal(inspect(x))
@103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
@102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
@102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
> x[1,1]=42L
> .Internal(inspect(x))
@10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
@102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
@102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
> x[[1]][1]=42L
> .Internal(inspect(x))
@103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
@102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
@101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,...
> x[[1]][1]=42L
> .Internal(inspect(x))
@10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
@102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
@102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
I have R to release ;) so I won't be looking into this right now, but it's
something worth investigating ... Since all the inner contents have NAMED=0 I
would not expect any duplication to be needed, but apparently becomes so is at
some point ...
Cheers,
Simon
On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:
>
> On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
>> No subassignment function satisfies that condition, because you can always
>> call them directly. However, that doesn't stop the default method from
>> making that assumption, so I'm not sure it's an issue.
>>
>> David, Just to clarify - the data frame content is not copied, we are
>> talking about the vector holding columns.
>
> If it is just the vector holding the columns that is copied (and not the
> columns themselves), why does n make a difference in this test (on R
> 2.13.0)?
>
>> n = 1000
>> x = data.frame(a=1:n,b=1:n)
>> system.time(for (i in 1:1000) x[1,1] <- 42L)
> user system elapsed
> 0.628 0.000 0.628
>> n = 100000
>> x = data.frame(a=1:n,b=1:n) # still 2 columns, but longer columns
>> system.time(for (i in 1:1000) x[1,1] <- 42L)
> user system elapsed
> 20.145 1.232 21.455
>>
>
> With $<- :
>
>> n = 1000
>> x = data.frame(a=1:n,b=1:n)
>> system.time(for (i in 1:1000) x$a[1] <- 42L)
> user system elapsed
> 0.304 0.000 0.307
>> n = 100000
>> x = data.frame(a=1:n,b=1:n)
>> system.time(for (i in 1:1000) x$a[1] <- 42L)
> user system elapsed
> 37.586 0.388 38.161
>>
>
> If it's because the 1st column needs to be copied (only) because that's
> the one being assigned to (in this test), that magnitude of slow down
> doesn't seem consistent with the time of a vector copy of the 1st
> column :
>
>> n=100000
>> v = 1:n
>> system.time(for (i in 1:1000) v[1] <- 42L)
> user system elapsed
> 0.016 0.000 0.017
>> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L})
> user system elapsed
> 1.816 1.076 2.900
>
> Finally, increasing the number of columns, again only the 1st is
> assigned to :
>
>> n=100000
>> x = data.frame(rep(list(1:n),100))
>> dim(x)
> [1] 100000 100
>> system.time(for (i in 1:1000) x[1,1] <- 42L)
> user system elapsed
> 167.974 50.903 219.711
>>
>
>
>
>>
>> Cheers,
>> Simon
>>
>> Sent from my iPhone
>>
>> On Jul 5, 2011, at 9:01 PM, David Winsemius <[email protected]> wrote:
>>
>>>
>>> On Jul 5, 2011, at 7:18 PM, <[email protected]>
>>> <[email protected]> wrote:
>>>
>>>> On Tue, 5 Jul 2011, Simon Urbanek wrote:
>>>>
>>>>>
>>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
>>>>>
>>>>>> Simon (and all),
>>>>>>
>>>>>> I've tried to make assignment as fast as calling `[<-.data.table`
>>>>>> directly, for user convenience. Profiling shows (IIUC) that it isn't
>>>>>> dispatch, but x being copied. Is there a way to prevent '[<-' from
>>>>>> copying x?
>>>>>
>>>>> Good point, and conceptually, no. It's a subassignment after all - see
>>>>> R-lang 3.4.4 - it is equivalent to
>>>>>
>>>>> `*tmp*` <- x
>>>>> x <- `[<-`(`*tmp*`, i, j, value)
>>>>> rm(`*tmp*`)
>>>>>
>>>>> so there is always a copy involved.
>>>>>
>>>>> Now, a conceptual copy doesn't mean real copy in R since R tries to keep
>>>>> the pass-by-value illusion while passing references in cases where it
>>>>> knows that modifications cannot occur and/or they are safe. The default
>>>>> subassign method uses that feature which means it can afford to not
>>>>> duplicate if there is only one reference -- then it's safe to not
>>>>> duplicate as we are replacing that only existing reference. And in the
>>>>> case of a matrix, that will be true at the latest from the second
>>>>> subassignment on.
>>>>>
>>>>> Unfortunately the method dispatch (AFAICS) introduces one more reference
>>>>> in the dispatch chain so there will always be two references so
>>>>> duplication is necessary. Since we have only 0 / 1 / 2+ information on
>>>>> the references, we can't distinguish whether the second reference is due
>>>>> to the dispatch or due to the passed object having more than one
>>>>> reference, so we have to duplicate in any case. That is unfortunate, and
>>>>> I don't see a way around (unless we handle subassignment methods is some
>>>>> special way).
>>>>
>>>> I don't believe dispatch is bumping NAMED (and a quick experiment
>>>> seems to confirm this though I don't guarantee I did that right). The
>>>> issue is that a replacement function implemented as a closure, which
>>>> is the only option for a package, will always see NAMED on the object
>>>> to be modified as 2 (because the value is obtained by forcing the
>>>> argument promise) and so any R level assignments will duplicate. This
>>>> also isn't really an issue of imprecise reference counting -- there
>>>> really are (at least) two legitimate references -- one though the
>>>> argument and one through the caller's environment.
>>>>
>>>> It would be good it we could come up with a way for packages to be
>>>> able to define replacement functions that do not duplicate in cases
>>>> where we really don't want them to, but this would require coming up
>>>> with some sort of protocol, minimally involving an efficient way to
>>>> detect whether a replacement funciton is being called in a replacement
>>>> context or directly.
>>>
>>> Would "$<-" always satisfy that condition. It would be big help to me if it
>>> could be designed to avoid duplication the rest of the data.frame.
>>>
>>> --
>>>
>>>>
>>>> There are some replacement functions that use C code to cheat, but
>>>> these may create problems if called directly, so I won't advertise
>>>> them.
>>>>
>>>> Best,
>>>>
>>>> luke
>>>>
>>>>>
>>>>> Cheers,
>>>>> Simon
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Luke Tierney
>>>> Statistics and Actuarial Science
>>>> Ralph E. Wareham Professor of Mathematical Sciences
>>>> University of Iowa Phone: 319-335-3386
>>>> Department of Statistics and Fax: 319-335-3017
>>>> Actuarial Science
>>>> 241 Schaeffer Hall email: [email protected]
>>>> Iowa City, IA 52242 WWW:
>>>> http://www.stat.uiowa.edu______________________________________________
>>>> [email protected] mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>> David Winsemius, MD
>>> West Hartford, CT
>>>
>>>
>
>
>
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel