Re: [Bioc-devel] Memory usage for bplapply

Martin Morgan Sat, 05 Jan 2019 14:25:52 -0800

In one R session I did library(SummarizedExperiment) and then saved search(). 
In another R session I loaded the packages on the search path in reverse order, 
recording pryr::mem_used() after each. I ended up with


                      mem_used
methods               25870312
datasets              30062016
utils                 30062136
grDevices             30062256
graphics              30062376
stats                 30062496
stats4                32262992
parallel              32495080
BiocGenerics          38903928
S4Vectors             59586928
IRanges              100171896
GenomeInfoDb         113791328
GenomicRanges        154729400
Biobase              163335336
matrixStats          163518520
BiocParallel         167373512
DelayedArray         280812736
SummarizedExperiment 317386656

Each of the Bioconductor dependencies of SummarizedExperiment contribute to the 
overall size. Two dependencies (Biobase, DelayedArray) look a little 
unnecessary to me (they do not provide functionality that must be used by 
SummarizedExperiment) but removing them only reduces the total footprint to 
about 300MB. Somehow it makes sense that a package like SummarizedExperiment 
uses the data structures defined in other packages, and that it has a complex 
dependency graph. It is surprising how large the final footprint is.

One possible way to avoid at least some of the cost is to Import: 
SummarizedExperiment in the DESCRIPTION file, but not mention 
SummarizedExperiment in the NAMESPACE. Use SummarizedExperiment::assay() in the 
code. I think this has complicated side effects, e.g., adding methods to the 
imported methods table in your package (look for ".__T__" and ".__C__" (generic 
and class definitions) in ls(parent.env(getNamespace(<your package>)))), that 
indirectly increase the size of your package.

I'm not exactly sure what you mean in your second paragraph, maybe a specific 
example (if necessary, create a small package on github) would help. It sounds 
like you're saying that even with doSNOW() there are additional costs to 
loading your package on the worker compared to in the master...

Martin

On 1/5/19, 2:44 PM, "Lulu Chen" <luluc...@vt.edu> wrote:

    Hi Martin,
    
    
    Thanks for your explanation which make me understand BiocParallel much 
better. 
    
    
    I compare memory usage in my code before packaged (using doSNOW) and after 
packaged (using BiocParallel) and find the increased memory is caused by the 
attached packages, especially 'SummarizedExperiment'. 
    As required to support common Bioconductor class, I used 
importFrom(SummarizedExperiment,assay). After deleting this, the memory for 
each thread save nearly 200Mb. I open a new R session and find
    > pryr::mem_used()
    38.5 MB
    > library(SummarizedExperiment)
    
    > pryr::mem_used()
    314 MB
    
     (I am still using R 3.5.2, not sure any update in develop version). I 
think it should be a issue. A lot of packages are importing 
SummarizedExperiment just for a support and never know it can cause such a 
problem.
    
    
    My package still imports other packages, e.g limma, fdrtool. Checked by 
pryr::mem_used() as above, only 1~2 Mb increase for each. I also check 
my_package in a new session, which is around 5Mb. However,  each thread in 
parallel computation still increases
     much larger than 5 Mb. I did a simulation: In my old code with doSNOW, I 
just inserted "require('my_package')" into foreach loop and keep other code as 
the same. I used 20 cores and 1000 jobs. Each thread still increases 20~30 Mb. 
I don't know if there are
     any other thing that cause extra cost to each thread. Thanks!
    
    
    Best,
    Lulu
    
    
    
    
    
    
    On Fri, Jan 4, 2019 at 2:38 PM Martin Morgan <mtmorgan.b...@gmail.com> 
wrote:
    
    
    Memory use can be complicated to understand.
    
        library(BiocParallel)
    
        v <- replicate(100, rnorm(10000), simplify=FALSE)
        bplapply(v, sum)
    
    by default, bplapply splits 100 jobs (each element of the list) equally 
between the number of cores available, and sends just the necessary data to the 
cores. Again by default, the jobs are sent 'en masse' to the cores, so if there 
were 10 cores (and hence
     10 tasks), the first core would receive the first 10 jobs and 10 x 10000 
elements, and so on. The memory used to store v on the workers would be 
approximately the size of v, # of workers * jobs /per worker  * job size = 10 * 
10 * 10000.
    
    If memory were particularly tight, or if computation time for each job was 
highly variable, it might be advantageous to sends jobs one at a time, by 
setting the number of tasks equal to the number of jobs SnowParam(workers = 10, 
tasks = length(v)). Then the
     amount of memory used to store v would only be # of workers * 1  * 10000; 
this is generally slower, because there is much more communication between the 
manager and the workers.
    
        m <- matrix(rnorm(100 * 10000), 100, 10000)
        bplapply(seq_len(nrow(m)), function(i, m) sum(m[i]), m)
    
    Here bplapply doesn't know how to send just some rows to the workers, so 
each worker gets a complete copy of m. This would be expensive.
    
        f <- function(x) sum(x)
    
        g <- function() {
            v <- replicate(100, rnorm(10000), simplify=FALSE)
            bplapply(v, f)
        }
    
    this has the same memory consequences as above, the function `f()` is 
defined in the .GlobalEnv, so only the function definition (small) is sent to 
the workers.   
    
    
        h <- function() {
            f <- function(x) sum(x)
            v <- replicate(100, rnorm(10000), simplify=FALSE)
            bplapply(v, f)
        }
    
     This is expensive. The function `f()` is defined in the body of the 
function `h()`. So the workers receive both the function f and the environment 
in which it defined. The environment includes v, so each worker receives a 
slice of v (for f() to operate on)
     AND an entire copy of v (because it is in the body of the environment 
where `f()` was defined. A similar cost would be paid in a package, if the 
package defined large data objects at load time.
    
    For more guidance, it might be helpful to provide a simplified example of 
what you did with doSNOW, and what you do with BiocParallel.
    
    Hope that helps,
    
    Martin
    
    On 1/3/19, 11:52 PM, "Bioc-devel on behalf of Lulu Chen" 
<bioc-devel-boun...@r-project.org on behalf of
    luluc...@vt.edu> wrote:
    
        Dear all,
    
        I met a memory issue for bplapply with SnowParam(). I need to calculate
        something from a large matrix many many times. But from the discussions 
in
        
    https://support.bioconductor.org/p/92587 
<https://support.bioconductor.org/p/92587>, I learned that bplapply copied
        the current and parent environment to each worker thread. Then means the
        large matrix in my package will be copied so many times. Do you have 
better
        suggestions in windows platform?
    
        Before I tried to package my code, I used doSNOW package with foreach
        %dopar%. It seems to consume less memory in each core (almost the size 
of
        the matrix the task needs). But bplapply seems to copy more then 
objects in
        current environment and the above one level environment. I am very
        confused.and just guess it was copying everything.
    
        Thanks for any help!
        Best,
        Lulu
    
            [[alternative HTML version deleted]]
    
        _______________________________________________
        Bioc-devel@r-project.org mailing list
        
    https://stat.ethz.ch/mailman/listinfo/bioc-devel 
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
    
    
    
    
    
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Memory usage for bplapply

Reply via email to