On 3/28/2006 5:46 PM, Steven Lacey wrote: > Duncan & Gabor, > > It works! When I no longer save the environment associated with the formulas > (and no other environments), the size of my saved objects are all around > 350KB, which is actually smaller than there size in R. What a relief! That > was driving me nuts! > > In R, is the environment of an object a pointer? Is it only when the object > is saved (and the environment may no longer exist when loaded again) that > the objects in the environment are themselves saved, as opposed to a pointer > to an environment?
Not quite pointers, but environments are stored as references. Not all objects have environments. Functions do, and a few other objects where symbols might need to be evaluated (such as formulas). By the way, just for fun this afternoon I started writing a patch to the serialize code that reports on file offsets of things as it reads them. I won't commit this to the main build because - it's not that accurate - I can't be bothered making it perfect - it makes the code ugly but I might post the patch somewhere. Duncan Murdoch > > Thanks again! > Steve > > -----Original Message----- > From: Duncan Murdoch [mailto:[EMAIL PROTECTED] > Sent: Tuesday, March 28, 2006 3:02 PM > To: Steven Lacey > Cc: 'Gabor Grothendieck'; [email protected] > Subject: Re: [R] object size vs. file size > > > On 3/28/2006 2:54 PM, Steven Lacey wrote: >> Duncan, >> >> I wrote an R package to process my data. The package was written in >> such a way that I no longer stored functions themselves in my "sa" >> objects, just their names (as strings) instead. I re-ran my analysis >> and found that, indeed the saved object sizes were smaller when I was >> not saving attached environments. However, I still find the object >> size discrepancy. That is, I have two objects tmp and tmp1 that are >> the same size in R (when calling object.size both are 870116 bytes), >> but vastly different sizes as save objects (tmp = 1091KB, >> tmp1=8436KB). >> >> While saving the environment is an issue in overall size, I am not >> sure it accounts for the difference in size. I am beginning to think >> it has to do with the code used to generate the objects. >> >> To do the fitting (which creates tmp and tmp1 objects): >> >> 1) d.rt <- split a dataframe >> 2) define a list called arg, which defines all the parameters for the >> fitting >> >> My problem is that I need to call the function that does the fitting >> (df2sa) once for each dataframe in the list d.rt with the parameters >> specificed in arg. To do this I add two additional components to arg >> list: Arg$X <- d.rt Arg$FUN <- "df2sa.models" #This function manages >> the fitting for each dataframe in d.rt. >> >> Now I call: >> Do.call("lapply",arg) >> I expect it to call df2sa for each dataframe in d.rt passing in the >> remaining parameters in the arg list. The code "works" in the sense >> that I get the returned objects, but when I save them the sizes are >> strange, as described above. >> >> I obtain the "small" version of the same object when I call: tmp <- >> do.call(df2sa,arg). >> >> In this case there is no lapply wrapper. Somehow lapply is adding >> something more to what is returned, but I am not sure what or how. >> What is also strange is that the object in question is not the last >> element in d.rt, so it's not as if lapply is returning everything in that > one object. >> I attached the object files again and the class definitions required >> to view them. However, note that the object names differ from the ones >> used above. >> >> Tmp = incompat >> Tmp1 = x0302.incompatible.RT.fits >> >> Please help! > > Sorry, I can't really help. I suspect it's still an issue of > environments, but you'll need to find someone who knows the S4 internals > better than me to figure out where the environments are hiding. > > Duncan Murdoch > >> Thanks, >> Steve >> >> -----Original Message----- >> From: Duncan Murdoch [mailto:[EMAIL PROTECTED] >> Sent: Sunday, March 26, 2006 10:34 AM >> To: Gabor Grothendieck >> Cc: Steven Lacey; [email protected] >> Subject: Re: [R] object size vs. file size >> >> >> On 3/25/2006 10:16 PM, Gabor Grothendieck wrote: >>> You can place functions in lists or environments and pass the >>> environment to the function and have it look there first. That way you >>> can have different versions of a function with the same name. >>> >>> 1. Here is an example using lists: >>> >>> A <- list(f = sin) >>> B <- list(f = cos) >>> f <- function(x) x+2 >>> >>> myfun <- function(x, L = NULL) with(L, f)(x) >>> >>> myfun(0) # 2 >>> myfun(0, A) # 0 >>> myfun(0, B) # 1 >>> >>> All three of the above make a call to f but the first uses the f in >>> the global environment, the second uses the f in A and the third uses >>> the f in B. >>> >>> 2. Above we illustrated this using lists but it can also be done >>> using >>> environments. In the following we use the proto package to facilitiate >>> this. proto objects are built on top of environments., For example, >>> you could replace the first two lines in the prior example with: >>> >>> library(proto) >>> A <- proto(f = sin) >>> B <- proto(f = cos) >>> >>> Note that in #1 and #2 myfun did have to be programmed to handle >>> this. Another way to do this which does not require myfun to be >>> preprogrammed is the following: >>> >>> >>> library(proto) >>> A <- proto(f = sin) >>> B <- proto(f = cos) >>> myfun <- function(x) f(x) >>> >>> myfun(0) # 2 >>> with(A$proto(myfun = myfun), myfun)(0) # 0 with(B$proto(myfun = >>> myfun), myfun)(0) # 1 >>> >>> The first with statement defines a child object of A which contains >>> a single method myfun, A$proto(myfun = myfun). Then it calls the >>> myfun in that new object. Since the new object is a child of A, >>> myfun >>> will look for f in the new object and not finding it will search >>> the parent A and find it there. Similarly for B in the second with >>> statement. >>> >>> >>> >>> Regarding removing environments, if if is a function you can do this: >>> >>> environment(f) <- NULL >>> >>> but you will likely need to restore the environment prior to using f. >> That will get you a warning in 2.3.0 (and replace the NULL with >> baseenv()), and an error in 2.4.0. In current and past versions, a NULL >> wasn't interpreted as "no environment", it was interpreted as the base >> environment. >> >> If you want something that is like "no environment", you can use >> emptyenv() in 2.3.0, but this would rarely make sense for an R function: >> even the most basic things involved in evaluation need to come from >> somewhere. emptyenv() is mainly designed for situations where you want >> an entirely separate namespace, not related to R functions at all, but >> using the same syntax and rules for lookups. >> >> Duncan Murdoch >> >>> On 3/25/06, Steven Lacey <[EMAIL PROTECTED]> wrote: >>>> Duncan, >>>> >>>> Thanks! This is progress! One solution might be to remove all >>>> environments from the objects that I want to save in the "sa" object, >>>> thereby avoiding the problem of saving environments altogther. But, >>>> can I remove the environment from a function? Does that even make >>>> sense given how R operates under the hood? Even if I could, would the >>>> functions still work? >>>> >>>> Here is my more general problem. As I learn more about R and the >>>> demands made on my code change, I sometimes change a function >>>> referenced by a given name rather than explicitly defining a new >>>> version of that function. This creates a problem when I want to >>>> review how the model stored in the "sa" object was originally >>>> created. If only the function name is stored in the "sa" object, I >>>> won't necessarily know what version was actually called at the time >>>> the model was constructed because I did not rename it. To deal with >>>> this I decided to store the function itself. >>>> >>>> Sounds like this may not be a great idea, or at least comes with >>>> serious trade-offs, particularly as some functions are generic like >>>> the mean. Is there a better way to save a function than to save the >>>> function itself or just its name? For instance, do args() and body() >>>> return an associated environment? I assume I could recreate the >>>> original function from these objects, correct? If so, is there some >>>> easy way to do it? >>>> >>>> Alternatively, are there any version control tools built into R? >>>> That >>>> is, is there a way R can keep track of the version for me (as opposed >>>> to explicitly declaring different verions foo<-..., foo.v1<-..., >>>> foo.v2<-...)? I am not sure exactly what I am asking for here. The >>>> more I write the more this seems unreasonable. A new function >>>> requires a new name, right? I just find myself writing lots of new >>>> versions and keeping track of their names, which one does what, and >>>> changing the names in other functions that call them a little >>>> overwhelming. Maybe the way to deal with this is to write different >>>> versions of same package. That way the versions will effect the >>>> naming of and the call to load the package, but not the calls to >>>> individual functions. This way functions can have the same name, but >>>> do different things depending on the package version, not the >>>> function name. However, I have never created a package and would >>>> prefer not to do so in the short-term (my dissertation is due in >>>> August), unless it is fairly straightforward. >>>> >>>> The more I think about it a package is more accurately what I want. >>>> I >>>> want to be able to recreate the analysis of my data long after it has >>>> been completed. If I had packages, then I would just need to know >>>> what version of the package was used, load it, and re-run the >>>> analysis. I wouldn't need to store the critical functions in the >>>> object. Where might I find good introduction to writing packages? >>>> >>>> In the short-term would the solution above (using body and args) >>>> work? >>>> >>>> Thanks again, >>>> Steve >>>> >>>> >>>> -----Original Message----- >>>> From: Duncan Murdoch [mailto:[EMAIL PROTECTED] >>>> Sent: Saturday, March 25, 2006 5:31 PM >>>> To: Steven Lacey >>>> Cc: [email protected] >>>> Subject: Re: [R] object size vs. file size >>>> >>>> >>>> On 3/25/2006 7:32 AM, Steven Lacey wrote: >>>>> Hi, >>>>> >>>>> There is rather large discrepancy in the size of the object as it >>>>> lives in R and the size of the object when it is written to the >>>>> disk. The object in question is an S4 of a homemade class "sa". I >>>>> first call a function that writes a list of these objects to a file >>>>> called "data.RData". The size of this file is 14,411 KB. I would >>>>> assume on average then, that each list component--there are 32 sa >>>>> objects in data.RData--would be approximately 450 KB (14,111/32). >>>>> However, when I load the data into R and call object.size on just >>>>> one s4 object (call it tmp) it returns 77496 bytes (77 KB)! What is >>>>> even stranger is that if I save this S4 object alone by calling >>>>> save(tmp, file="tmp.RData"), tmp.RData is 13.3 MB! I understand from >>>>> the help on object.size that the object size is only approximate and >>>>> excludes the space recquired to store its name in the symbol table. >>>>> But, this difference in object size and file size is huge! This >>>>> phenomenon occurs no matter which S4 object I save from data.RData. >>>>> >>>>> Why is the object so big when it is in a file? What else is getting >>>>> stored with it? I have examined the object in R to find additional >>>>> information stored with it, but have not found anything that would >>>>> account for the size of the object in the file system. For example, >>>>>> environment(tmp) >>>>> NULL >>>> I'm not 100% sure where the problem is, but I think it probably does >>>> involve environments. Your tmp object contains a number of >>>> functions. I think when some function is saved, its environment is >>>> being saved too, and the environment contains much more than you >>>> thought. >>>> >>>> R doesn't normally save a new copy of a package or namespace >>>> environment when it saves a function, nor does it save a complete >>>> copy of .GlobalEnv with every function defined there, but it does >>>> save the environment in some other circumstances. For example, look >>>> at this code: >>>> >>>> > f <- function() { >>>> + notused <- 1:1000000 >>>> + value <- function() 1 >>>> + return(value) >>>> + } >>>> > >>>> > g <- f() >>>> > g >>>> function() 1 >>>> <environment: 01B10D1C> >>>> > save(g, file='g.RData') >>>> > object.size(g) >>>> [1] 200 >>>> >>>> The g object is 200 bytes or so, but when it is saved, the defining >>>> environment containing that huge "notused" variable is saved with it, >>>> so g.RData ends up being about 4 Megabytes in size. >>>> >>>> I don't know any function that will help to diagnose where this >>>> happens. Here's one that doesn't quite work: >>>> >>>> findenvironments <- function(x) { >>>> e <- environment(x) >>>> if (is.null(e)) result <- NULL >>>> else result <- list(e) >>>> x <- unclass(x) >>>> if (is.list(x)) { >>>> for (i in seq(along=x)) { >>>> contained <- findenvironments(x[[i]]) >>>> if (length(contained)) result <- c(result, contained) >>>> } >>>> } >>>> if (length(result)) browser() >>>> result >>>> } >>>> >>>> This won't recurse into the slots of an S4 object, so it doesn't >>>> really help you, and I'm not sure how to do that. But maybe someone >>>> else can fix it. >>>> >>>> Duncan Murdoch >>>> >>>> ______________________________________________ >>>> [email protected] mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide! >>>> http://www.R-project.org/posting-guide.html >>>> >>> ______________________________________________ >>> [email protected] mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide! >>> http://www.R-project.org/posting-guide.html >> >> > > > > > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
