Thanks for the replies and info. An attempt at fast
assign is now committed to data.table v1.6.3 on
R-Forge. From NEWS :

o   Fast update is now implemented, FR#200.
    DT[i,j]<-value is now handled by data.table in C rather
    than falling through to data.frame methods.
    
    Thanks to Ivo Welch for raising speed issues on r-devel,
    to Simon Urbanek for the suggestion, and Luke Tierney and
    Simon for information on R internals.

    [<- syntax still incurs one working copy of the whole
    table (as of R 2.13.0) due to R's [<- dispatch mechanism
    copying to `*tmp*`, so, for ultimate speed and brevity,
    'within' syntax is now available as follows.
        
o   A new 'within' argument has been added to [.data.table,
    by default TRUE. It is very similar to the within()
    function in base R. If an assignment appears in j, it
    assigns to the column of DT, by reference; e.g.,
         
    DT[i,colname<-value]
        
    This syntax makes no copies of any part of memory at all.
        
    > m = matrix(1,nrow=100000,ncol=100)
    > DF = as.data.frame(m)
    > DT = as.data.table(m)
    > system.time(for (i in 1:1000) DF[1,1] <- 3)
       user  system elapsed 
    287.730 323.196 613.453 
    > system.time(for (i in 1:1000) DT[1,V1 <- 3])
       user  system elapsed 
      1.152   0.004   1.161         # 528 times faster

Please note :
        
    *******************************************************
    **  Within syntax is presently highly experimental.  **
    *******************************************************

http://datatable.r-forge.r-project.org/


On Wed, 2011-07-06 at 09:08 -0500, luke-tier...@uiowa.edu wrote:
> On Wed, 6 Jul 2011, Simon Urbanek wrote:
> 
> > Interesting, and I stand corrected:
> >
> >> x = data.frame(a=1:n,b=1:n)
> >> .Internal(inspect(x))
> > @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
> >  @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
> >  @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
> >
> >> x[1,1]=42L
> >> .Internal(inspect(x))
> > @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
> >  @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
> >  @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
> >
> >> x[[1]][1]=42L
> >> .Internal(inspect(x))
> > @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
> >  @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
> >  @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,...
> >
> >> x[[1]][1]=42L
> >> .Internal(inspect(x))
> > @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
> >  @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
> >  @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
> >
> >
> > I have R to release ;) so I won't be looking into this right now, but it's 
> > something worth investigating ... Since all the inner contents have NAMED=0 
> > I would not expect any duplication to be needed, but apparently becomes so 
> > is at some point ...
> 
> 
> The internals assume in various places that deep copies are made (one
> of the reasons NAMED setings are not propagated to sub-sturcture).
> The main issues are avoiding cycles and that there is no easy way to
> check for sharing.  There may be some circumstances in which a shallow
> copy would be OK but making sure it would be in all cases is probably
> more trouble than it is worth at this point. (I've tried this in the
> past in a few cases and always had to back off.)
> 
> 
> Best,
> 
> luke
> 
> >
> > Cheers,
> > Simon
> >
> >
> > On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:
> >
> >>
> >> On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
> >>> No subassignment function satisfies that condition, because you can 
> >>> always call them directly. However, that doesn't stop the default method 
> >>> from making that assumption, so I'm not sure it's an issue.
> >>>
> >>> David, Just to clarify - the data frame content is not copied, we are 
> >>> talking about the vector holding columns.
> >>
> >> If it is just the vector holding the columns that is copied (and not the
> >> columns themselves), why does n make a difference in this test (on R
> >> 2.13.0)?
> >>
> >>> n = 1000
> >>> x = data.frame(a=1:n,b=1:n)
> >>> system.time(for (i in 1:1000) x[1,1] <- 42L)
> >>   user  system elapsed
> >>  0.628   0.000   0.628
> >>> n = 100000
> >>> x = data.frame(a=1:n,b=1:n)      # still 2 columns, but longer columns
> >>> system.time(for (i in 1:1000) x[1,1] <- 42L)
> >>   user  system elapsed
> >> 20.145   1.232  21.455
> >>>
> >>
> >> With $<- :
> >>
> >>> n = 1000
> >>> x = data.frame(a=1:n,b=1:n)
> >>> system.time(for (i in 1:1000) x$a[1] <- 42L)
> >>   user  system elapsed
> >>  0.304   0.000   0.307
> >>> n = 100000
> >>> x = data.frame(a=1:n,b=1:n)
> >>> system.time(for (i in 1:1000) x$a[1] <- 42L)
> >>   user  system elapsed
> >> 37.586   0.388  38.161
> >>>
> >>
> >> If it's because the 1st column needs to be copied (only) because that's
> >> the one being assigned to (in this test), that magnitude of slow down
> >> doesn't seem consistent with the time of a vector copy of the 1st
> >> column :
> >>
> >>> n=100000
> >>> v = 1:n
> >>> system.time(for (i in 1:1000) v[1] <- 42L)
> >>   user  system elapsed
> >>  0.016   0.000   0.017
> >>> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L})
> >>   user  system elapsed
> >>  1.816   1.076   2.900
> >>
> >> Finally, increasing the number of columns, again only the 1st is
> >> assigned to :
> >>
> >>> n=100000
> >>> x = data.frame(rep(list(1:n),100))
> >>> dim(x)
> >> [1] 100000    100
> >>> system.time(for (i in 1:1000) x[1,1] <- 42L)
> >>   user  system elapsed
> >> 167.974  50.903 219.711
> >>>
> >>
> >>
> >>
> >>>
> >>> Cheers,
> >>> Simon
> >>>
> >>> Sent from my iPhone
> >>>
> >>> On Jul 5, 2011, at 9:01 PM, David Winsemius <dwinsem...@comcast.net> 
> >>> wrote:
> >>>
> >>>>
> >>>> On Jul 5, 2011, at 7:18 PM, <luke-tier...@uiowa.edu> 
> >>>> <luke-tier...@uiowa.edu> wrote:
> >>>>
> >>>>> On Tue, 5 Jul 2011, Simon Urbanek wrote:
> >>>>>
> >>>>>>
> >>>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
> >>>>>>
> >>>>>>> Simon (and all),
> >>>>>>>
> >>>>>>> I've tried to make assignment as fast as calling `[<-.data.table`
> >>>>>>> directly, for user convenience. Profiling shows (IIUC) that it isn't
> >>>>>>> dispatch, but x being copied. Is there a way to prevent '[<-' from
> >>>>>>> copying x?
> >>>>>>
> >>>>>> Good point, and conceptually, no. It's a subassignment after all - see 
> >>>>>> R-lang 3.4.4 - it is equivalent to
> >>>>>>
> >>>>>> `*tmp*` <- x
> >>>>>> x <- `[<-`(`*tmp*`, i, j, value)
> >>>>>> rm(`*tmp*`)
> >>>>>>
> >>>>>> so there is always a copy involved.
> >>>>>>
> >>>>>> Now, a conceptual copy doesn't mean real copy in R since R tries to 
> >>>>>> keep the pass-by-value illusion while passing references in cases 
> >>>>>> where it knows that modifications cannot occur and/or they are safe. 
> >>>>>> The default subassign method uses that feature which means it can 
> >>>>>> afford to not duplicate if there is only one reference -- then it's 
> >>>>>> safe to not duplicate as we are replacing that only existing 
> >>>>>> reference. And in the case of a matrix, that will be true at the 
> >>>>>> latest from the second subassignment on.
> >>>>>>
> >>>>>> Unfortunately the method dispatch (AFAICS) introduces one more 
> >>>>>> reference in the dispatch chain so there will always be two references 
> >>>>>> so duplication is necessary. Since we have only 0 / 1 / 2+ information 
> >>>>>> on the references, we can't distinguish whether the second reference 
> >>>>>> is due to the dispatch or due to the passed object having more than 
> >>>>>> one reference, so we have to duplicate in any case. That is 
> >>>>>> unfortunate, and I don't see a way around (unless we handle 
> >>>>>> subassignment methods is some special way).
> >>>>>
> >>>>> I don't believe dispatch is bumping NAMED (and a quick experiment
> >>>>> seems to confirm this though I don't guarantee I did that right). The
> >>>>> issue is that a replacement function implemented as a closure, which
> >>>>> is the only option for a package, will always see NAMED on the object
> >>>>> to be modified as 2 (because the value is obtained by forcing the
> >>>>> argument promise) and so any R level assignments will duplicate.  This
> >>>>> also isn't really an issue of imprecise reference counting -- there
> >>>>> really are (at least) two legitimate references -- one though the
> >>>>> argument and one through the caller's environment.
> >>>>>
> >>>>> It would be good it we could come up with a way for packages to be
> >>>>> able to define replacement functions that do not duplicate in cases
> >>>>> where we really don't want them to, but this would require coming up
> >>>>> with some sort of protocol, minimally involving an efficient way to
> >>>>> detect whether a replacement funciton is being called in a replacement
> >>>>> context or directly.
> >>>>
> >>>> Would "$<-" always satisfy that condition. It would be big help to me if 
> >>>> it could be designed to avoid duplication the rest of the data.frame.
> >>>>
> >>>> --
> >>>>
> >>>>>
> >>>>> There are some replacement functions that use C code to cheat, but
> >>>>> these may create problems if called directly, so I won't advertise
> >>>>> them.
> >>>>>
> >>>>> Best,
> >>>>>
> >>>>> luke
> >>>>>
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Simon
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Luke Tierney
> >>>>> Statistics and Actuarial Science
> >>>>> Ralph E. Wareham Professor of Mathematical Sciences
> >>>>> University of Iowa                  Phone:             319-335-3386
> >>>>> Department of Statistics and        Fax:               319-335-3017
> >>>>> Actuarial Science
> >>>>> 241 Schaeffer Hall                  email:      l...@stat.uiowa.edu
> >>>>> Iowa City, IA 52242                 WWW:  
> >>>>> http://www.stat.uiowa.edu______________________________________________
> >>>>> R-devel@r-project.org mailing list
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>>
> >>>> David Winsemius, MD
> >>>> West Hartford, CT
> >>>>
> >>>>
> >>
> >>
> >>
> >
> >
> 
> -- 
> Luke Tierney
> Statistics and Actuarial Science
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>     Actuarial Science
> 241 Schaeffer Hall                  email:      l...@stat.uiowa.edu
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to