Tim Hesterberg <rocket <at> google.com> writes: > > * however, Luke Tierney is looking at that too and trying to > change R to make those tricks unnecessary. He made a change to > the development version that may make the "attributes<-"(...) > trick I use unnecessary.
Great. That's now made its way through to NEWS, and seems to be names()<- too : 2.15.0 patched PERFORMANCE IMPROVEMENTS There is less copying when using primitive replacement functions such as names(), attr() and attributes(). Will look forward to testing that out and maybe we can simplify some of data.table too. Hopefully they won't copy DF at all? Assignment by reference in data.table is about avoiding even a single copy. If names()<-, attr()<- and attributes()<- copy DF, even just once, then it'll still be infinitely faster to use the set* functions in data.table (any time / 0 = Inf), but there it's 'out of memory' or the (later) time to garbage collect that's the practical concern; not the Inf speedup factor really. Copying a 50GB data.table once on a 128GB machine isn't an insignificant time, either. Say that takes 2 seconds, but what about the other users on the server who're squeezed into 28GB, or swapped out to disk. When you're swapped out, performance falls off a cliff even for the simplest task. The announcement about dataframe detailed reductions in the number of copies to numbers greater than 0 as far as I could see. And the item in NEWS says "less copying" so leaves it unclear whether no copies are made in any cases. In example(setnames) it shows 4,3,1...0 (in 1.8.0) and 4,3,2,1...0 (in 1.8.1). If base R can manage to reduce copies to 0 in many cases, it would be fantastic. That's why I posted to r-devel: "confused about NAMED" (Nov 2011) trying to get changes like that made. Luke said he would look at it then so it's exciting he has : http://r.789695.n4.nabble.com/Confused-about-NAMED-tp4103326p4105017.html Also, have you seen the last paragraph of data.table FAQ 1.8? : A second proposal was to use memcpy in duplicate.c, which is much faster than a for loop in C. This would improve the way that R copies data internally (on some measures by 13 times). The thread on r-devel is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html. It isn't just the number of copies, but the _way_ R copies. Prof Ripley placed a FIXME in duplicate.c in 2006 (iirc). Perhaps someone could take a look at the thread linked in the FAQ and help grease the cogs there? If r-devel just fixed that FIXME it could speed up R a lot on large objects when it does copy. Matthew _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
