Re: [Rd] [datatable-help] speeding up perception

Simon Urbanek Wed, 06 Jul 2011 06:32:33 -0700

Interesting, and I stand corrected:

> x = data.frame(a=1:n,b=1:n)
> .Internal(inspect(x))
@103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
  @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
  @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...


> x[1,1]=42L
> .Internal(inspect(x))
@10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
  @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
  @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...

> x[[1]][1]=42L
> .Internal(inspect(x))
@103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
  @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
  @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,...

> x[[1]][1]=42L
> .Internal(inspect(x))
@10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
  @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
  @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...


I have R to release ;) so I won't be looking into this right now, but it's 
something worth investigating ... Since all the inner contents have NAMED=0 I 
would not expect any duplication to be needed, but apparently becomes so is at 
some point ...

Cheers,
Simon


On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:

> 
> On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
>> No subassignment function satisfies that condition, because you can always 
>> call them directly. However, that doesn't stop the default method from 
>> making that assumption, so I'm not sure it's an issue.
>> 
>> David, Just to clarify - the data frame content is not copied, we are 
>> talking about the vector holding columns.
> 
> If it is just the vector holding the columns that is copied (and not the
> columns themselves), why does n make a difference in this test (on R
> 2.13.0)?
> 
>> n = 1000
>> x = data.frame(a=1:n,b=1:n)
>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>   user  system elapsed 
>  0.628   0.000   0.628 
>> n = 100000
>> x = data.frame(a=1:n,b=1:n)      # still 2 columns, but longer columns
>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>   user  system elapsed 
> 20.145   1.232  21.455 
>> 
> 
> With $<- :
> 
>> n = 1000
>> x = data.frame(a=1:n,b=1:n)
>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>   user  system elapsed 
>  0.304   0.000   0.307 
>> n = 100000
>> x = data.frame(a=1:n,b=1:n)
>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>   user  system elapsed 
> 37.586   0.388  38.161 
>> 
> 
> If it's because the 1st column needs to be copied (only) because that's
> the one being assigned to (in this test), that magnitude of slow down
> doesn't seem consistent with the time of a vector copy of the 1st
> column : 
> 
>> n=100000
>> v = 1:n
>> system.time(for (i in 1:1000) v[1] <- 42L)
>   user  system elapsed 
>  0.016   0.000   0.017 
>> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L})
>   user  system elapsed 
>  1.816   1.076   2.900
> 
> Finally, increasing the number of columns, again only the 1st is
> assigned to :
> 
>> n=100000
>> x = data.frame(rep(list(1:n),100))
>> dim(x)
> [1] 100000    100
>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>   user  system elapsed 
> 167.974  50.903 219.711 
>> 
> 
> 
> 
>> 
>> Cheers,
>> Simon
>> 
>> Sent from my iPhone
>> 
>> On Jul 5, 2011, at 9:01 PM, David Winsemius <dwinsem...@comcast.net> wrote:
>> 
>>> 
>>> On Jul 5, 2011, at 7:18 PM, <luke-tier...@uiowa.edu> 
>>> <luke-tier...@uiowa.edu> wrote:
>>> 
>>>> On Tue, 5 Jul 2011, Simon Urbanek wrote:
>>>> 
>>>>> 
>>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
>>>>> 
>>>>>> Simon (and all),
>>>>>> 
>>>>>> I've tried to make assignment as fast as calling `[<-.data.table`
>>>>>> directly, for user convenience. Profiling shows (IIUC) that it isn't
>>>>>> dispatch, but x being copied. Is there a way to prevent '[<-' from
>>>>>> copying x?
>>>>> 
>>>>> Good point, and conceptually, no. It's a subassignment after all - see 
>>>>> R-lang 3.4.4 - it is equivalent to
>>>>> 
>>>>> `*tmp*` <- x
>>>>> x <- `[<-`(`*tmp*`, i, j, value)
>>>>> rm(`*tmp*`)
>>>>> 
>>>>> so there is always a copy involved.
>>>>> 
>>>>> Now, a conceptual copy doesn't mean real copy in R since R tries to keep 
>>>>> the pass-by-value illusion while passing references in cases where it 
>>>>> knows that modifications cannot occur and/or they are safe. The default 
>>>>> subassign method uses that feature which means it can afford to not 
>>>>> duplicate if there is only one reference -- then it's safe to not 
>>>>> duplicate as we are replacing that only existing reference. And in the 
>>>>> case of a matrix, that will be true at the latest from the second 
>>>>> subassignment on.
>>>>> 
>>>>> Unfortunately the method dispatch (AFAICS) introduces one more reference 
>>>>> in the dispatch chain so there will always be two references so 
>>>>> duplication is necessary. Since we have only 0 / 1 / 2+ information on 
>>>>> the references, we can't distinguish whether the second reference is due 
>>>>> to the dispatch or due to the passed object having more than one 
>>>>> reference, so we have to duplicate in any case. That is unfortunate, and 
>>>>> I don't see a way around (unless we handle subassignment methods is some 
>>>>> special way).
>>>> 
>>>> I don't believe dispatch is bumping NAMED (and a quick experiment
>>>> seems to confirm this though I don't guarantee I did that right). The
>>>> issue is that a replacement function implemented as a closure, which
>>>> is the only option for a package, will always see NAMED on the object
>>>> to be modified as 2 (because the value is obtained by forcing the
>>>> argument promise) and so any R level assignments will duplicate.  This
>>>> also isn't really an issue of imprecise reference counting -- there
>>>> really are (at least) two legitimate references -- one though the
>>>> argument and one through the caller's environment.
>>>> 
>>>> It would be good it we could come up with a way for packages to be
>>>> able to define replacement functions that do not duplicate in cases
>>>> where we really don't want them to, but this would require coming up
>>>> with some sort of protocol, minimally involving an efficient way to
>>>> detect whether a replacement funciton is being called in a replacement
>>>> context or directly.
>>> 
>>> Would "$<-" always satisfy that condition. It would be big help to me if it 
>>> could be designed to avoid duplication the rest of the data.frame.
>>> 
>>> -- 
>>> 
>>>> 
>>>> There are some replacement functions that use C code to cheat, but
>>>> these may create problems if called directly, so I won't advertise
>>>> them.
>>>> 
>>>> Best,
>>>> 
>>>> luke
>>>> 
>>>>> 
>>>>> Cheers,
>>>>> Simon
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> -- 
>>>> Luke Tierney
>>>> Statistics and Actuarial Science
>>>> Ralph E. Wareham Professor of Mathematical Sciences
>>>> University of Iowa                  Phone:             319-335-3386
>>>> Department of Statistics and        Fax:               319-335-3017
>>>> Actuarial Science
>>>> 241 Schaeffer Hall                  email:      l...@stat.uiowa.edu
>>>> Iowa City, IA 52242                 WWW:  
>>>> http://www.stat.uiowa.edu______________________________________________
>>>> R-devel@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>>> David Winsemius, MD
>>> West Hartford, CT
>>> 
>>> 
> 
> 
> 

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [datatable-help] speeding up perception

Reply via email to