On 05/07/2017 20:39, Martin Morgan wrote:
On 07/05/2017 12:59 PM, Robert Castelo wrote:
dear developers,

in the framework of a package i maintain, VariantFiltering, i'm using the 'FilterRules' class defined in the S4Vector package and i'm interested in serializing (e.g., saving to disk via 'saveRDS()') 'FilterRules' objects where some rules may defined using functions.

my problem is that the resulting RDS files take much more space than expected because apparently the environment of the functions is also serialized.

a toy example reproducing the situation could be the following:

library(S4Vectors)

## define a function that creates a ~7Mb numerical vector
## and returns a FilterRules object on a function that has
## nothing to do with this vector, except for sharing its
## environment. this tries to reproduce the situation in which
## a 'FilterRules' object is defined within the package
## 'VariantFiltering' where the environment is full of stuff
## unrelated to the 'FilterRules' object being created.

f <- function() {
   z <- rnorm(1000000)
   g <- function(x) 2*x

I guess

    g <- function(x) 2 * x > 10

or similar would satisfy the requirements of FilterRules to return an equal-lengthed logical vector


oops, yes of course.

   fr <- FilterRules(list(g=g))
   fr
}


## call the previous function to get the FilterRules object

fr <- f()


## while the 'FilterRules' object takes 3.3 Kb ...

print(object.size(fr), units="Kb")
3.3 Kb


## ... serializing it takes ~7Mb

print(object.size(serialize(fr, NULL)), units="Mb")
7.6 Mb


I added the test case

  testthat::expect_equal(eval(fr, 1:10), rep(c(FALSE, TRUE), each=5))

but then

g <- function(x) x > 10

which is good for simplicity

i guess this is the expected behavior behind functions and environments, but after reading about this subject (e.g., http://adv-r.had.co.nz/Environments.html) i still haven't been able to figure out how to serialize the 'FilterRules' object without the associated environment or with a minimal one without unnecessary objects around.

i'm sure many of you will have an easy workaround for this. any help will be highly appreciated.

One possibility is to set the environment of g() to something that resolves appropriate symbols, e.g.,

f <- function() {
    z <- rnorm(1000000)
    g <- function(x) 2 * x > 5
    environment(g) <- baseenv()
    FilterRules(list(g=g))
}

the serialized size is then 11 kb and the test continues to pass. The environment needs to be baseenv to resolve `*` and `>`; emptyenv() is too restrictive. A package name space might often be appropriate (though maybe large).

Maybe that's a Hack, and Michael or others will chime in with something better...

thanks!! indeed this reduces the size down to 1 kb:

f <- function() {
  z <- rnorm(1000000)
  g <- function(x) x > 5
  environment(g) <- baseenv()
  fr <- FilterRules(list(g=g))
  fr
}

fr <- f()
testthat::expect_equal(eval(fr, 1:10), rep(c(FALSE, TRUE), each=5))

print(object.size(fr), units="Kb")
1Kb
print(object.size(serialize(fr, NULL)), units="Kb")
1Kb

how would set the environment of the function to a package namespace?

wouldn't make more sense to leave it with baseenv() and call 'require(pkg)' within the function to load whatever the function needs from package 'pkg'?

robert.

Martin



thanks!!

robert.

_______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.

_______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to