See ?gctorture Nawaaz Ahmed <nawaaz <at> inktomi.com> writes:
: : Hi Folks, : Thanks for all your replies and input. In particular, thanks Luke, for : explaining what is happening under the covers. In retrospect, my example : using save and load to demonstrate the problem I was having was a : mistake - I was trying to reproduce the problem I was having in a simple : enough way and I thought save and load were showing the same problem : (i.e. an extra copy was being made). After carefully examining my gc() : traces, : I've come to realize that while there are copies being made, there is : nothing unexpected about it - the failure to allocate memory is really : because R is hitting the 3GB address limit imposed by my linux box : during processing. So as Luke suggests, maybe 32 bits is not the right : platform for handling large data in R. : : On the other hand, I think the problem can be somewhat alleviated : (though not eliminated) if we did garbage collection of temporary : variables immediately so that we can reduce the memory footprint and the : fragmentation problem that malloc() is going to be faced with : (gctorture() is probably too extreme . Most of the problems that I am : having are in the coercion routines which do create temporary copies. : So in code of the form x = as.vector(x), it would be nice if the old : value of x was garbage collected (i.e. if there were no references to it) : : nawaaz : : Luke Tierney wrote: : > On Thu, 24 Feb 2005, Berton Gunter wrote: : > : >> I was hoping that one of the R gurus would reply to this, but as they : >> have't : >> (thus far) I'll try. Caveat emptor! : >> : >> First of all, R passes function arguments by values, so as soon as you : >> call : >> foo(val) you are already making (at least) one other copy of val for the : >> call. : > : > : > Conceptually you have a copy, but internally R trieas to use a : > copy-on-modify strategy to avaoid copying unless necessary. THere are : > conservative approximations involved, so there is more copying than : > one might like but definitely not as much as this. : > : > : >> Second,you seem to implicitly make the assumption that assign(..., env=) : >> uses a pointer to point to the values in the environment. I do not : >> know how : >> R handles environments and assignments like this internally, but your : >> data : >> seems to indicate that it copies the value and does not merely point : >> to it : >> (this is where R Core folks can shed more authoritative light). : > : > : > This assignment does just store the pointer. : > : >> Finally, it makes perfect sense to me that, as a data structure, the : >> environment itself may be small even if it effectively points to (one of : >> several copies of) large objects, so that object.size(an.environment) : >> could : >> be small although the environment may "contain" huge arguments. Again, : >> the : >> details depend on the precise implementation and need clarification by : >> someone who actually knows what's going on here, which ain't me. : >> : >> I think the important message is that you shouldn't treat R as C, and you : >> shouldn't try to circumvent R's internal data structures and : >> conventions. R : >> is a language designed to implements Chambers's S model of : >> "Programming with : >> Data." Instead of trying to fool R to handle large data sets, maybe you : >> should consider whether you really **need** all the data in R at one time : >> and if sensible partitioning or sampling to analyze only a portion or : >> portions of the data might not be a more effective strategy. : > : > : > R can do quite a reasonable job with large data sets on a resonable : > platform. A 32 bit platform is not a reasonable one on which to use R : > with 800 MB chunks of data. Automatic memory management combined with : > the immutable vector semantics require more elbow room than that. If : > you really must use data of this size on a 32-bit platform you will : > probably be muchhappier using a limited amoutn of C code and external : > pointers. : > : > As to what is happening in this example: look at the default parent : > used by new.env and combine that with the fact that the serialization : > code does not preserve sharing of atomic objects. The two references : > to the large object are shared in the original session but lead to two : > large objects in the saved image and the load. Using : > : > ref <- list(env = new.env(parent = .GlobalEnv)) : > : > in new.ref avoids the second copy both in the saved image and after : > loading. : > : > luke : > : >> : >>> -----Original Message----- : >>> From: r-help-bounces <at> stat.math.ethz.ch : >>> [mailto:r-help-bounces <at> stat.math.ethz.ch] On Behalf Of Nawaaz Ahmed : >>> Sent: Thursday, February 24, 2005 10:36 AM : >>> To: r-help <at> stat.math.ethz.ch : >>> Subject: [R] Do environments make copies? : >>> : >>> I am using environments to avoid making copies (by keeping : >>> references). : >>> But it seems like there is a hidden copy going on somewhere - for : >>> example in the code fragment below, I am creating a reference to "y" : >>> (of size 500MB) and storing the reference in object "data". : >>> But when I : >>> save "data" and then restore it in another R session, gc() : >>> claims it is : >>> using twice the amount of memory. Where/How is this happening? : >>> : >>> Thanks for any help in working around this - my datasets are just not : >>> fitting into my 4GB, 32 bit linux machine (even though my actual data : >>> size is around 800MB) : >>> : >>> Nawaaz : >>> : >>> > new.ref <- function(value = NULL) { : >>> + ref <- list(env = new.env()) : >>> + class(ref) <- "refObject" : >>> + assign("value", value, env = ref$env) : >>> + ref : >>> + } : >>> > object.size(y) : >>> [1] 587941404 : >>> > y.ref = new.ref(y) : >>> > object.size(y.ref) : >>> [1] 328 : >>> > data = list() : >>> > data$y.ref = y.ref : >>> > object.size(data) : >>> [1] 492 : >>> > save(data, "data.RData") : >>> : >>> ... : >>> : >>> run R again : >>> =========== : >>> : >>> > load("data.RData") : >>> > gc() : >>> used (Mb) gc trigger (Mb) : >>> Ncells 141051 3.8 350000 9.4 : >>> Vcells 147037925 1121.9 147390241 1124.5 : >>> : >>> ______________________________________________ : >>> R-help <at> stat.math.ethz.ch mailing list : >>> https://stat.ethz.ch/mailman/listinfo/r-help : >>> PLEASE do read the posting guide! : >>> http://www.R-project.org/posting-guide.html : >>> : >> : >> ______________________________________________ : >> R-help <at> stat.math.ethz.ch mailing list : >> https://stat.ethz.ch/mailman/listinfo/r-help : >> PLEASE do read the posting guide! : >> http://www.R-project.org/posting-guide.html : >> : > : : ______________________________________________ : R-devel <at> stat.math.ethz.ch mailing list : https://stat.ethz.ch/mailman/listinfo/r-devel : : ______________________________________________ R-devel@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-devel