Memory use can be complicated to understand.

    library(BiocParallel)
    
    v <- replicate(100, rnorm(10000), simplify=FALSE)
    bplapply(v, sum)

by default, bplapply splits 100 jobs (each element of the list) equally between 
the number of cores available, and sends just the necessary data to the cores. 
Again by default, the jobs are sent 'en masse' to the cores, so if there were 
10 cores (and hence 10 tasks), the first core would receive the first 10 jobs 
and 10 x 10000 elements, and so on. The memory used to store v on the workers 
would be approximately the size of v, # of workers * jobs /per worker  * job 
size = 10 * 10 * 10000.

If memory were particularly tight, or if computation time for each job was 
highly variable, it might be advantageous to sends jobs one at a time, by 
setting the number of tasks equal to the number of jobs SnowParam(workers = 10, 
tasks = length(v)). Then the amount of memory used to store v would only be # 
of workers * 1  * 10000; this is generally slower, because there is much more 
communication between the manager and the workers.
    
    m <- matrix(rnorm(100 * 10000), 100, 10000)
    bplapply(seq_len(nrow(m)), function(i, m) sum(m[i]), m)

Here bplapply doesn't know how to send just some rows to the workers, so each 
worker gets a complete copy of m. This would be expensive.

    f <- function(x) sum(x)
        
    g <- function() {
        v <- replicate(100, rnorm(10000), simplify=FALSE)
        bplapply(v, f)
    }

this has the same memory consequences as above, the function `f()` is defined 
in the .GlobalEnv, so only the function definition (small) is sent to the 
workers.    

    h <- function() {
        f <- function(x) sum(x)
        v <- replicate(100, rnorm(10000), simplify=FALSE)
        bplapply(v, f)
    }
        
 This is expensive. The function `f()` is defined in the body of the function 
`h()`. So the workers receive both the function f and the environment in which 
it defined. The environment includes v, so each worker receives a slice of v 
(for f() to operate on) AND an entire copy of v (because it is in the body of 
the environment where `f()` was defined. A similar cost would be paid in a 
package, if the package defined large data objects at load time.

For more guidance, it might be helpful to provide a simplified example of what 
you did with doSNOW, and what you do with BiocParallel.

Hope that helps,

Martin

On 1/3/19, 11:52 PM, "Bioc-devel on behalf of Lulu Chen" 
<bioc-devel-boun...@r-project.org on behalf of luluc...@vt.edu> wrote:

    Dear all,
    
    I met a memory issue for bplapply with SnowParam(). I need to calculate
    something from a large matrix many many times. But from the discussions in
    https://support.bioconductor.org/p/92587, I learned that bplapply copied
    the current and parent environment to each worker thread. Then means the
    large matrix in my package will be copied so many times. Do you have better
    suggestions in windows platform?
    
    Before I tried to package my code, I used doSNOW package with foreach
    %dopar%. It seems to consume less memory in each core (almost the size of
    the matrix the task needs). But bplapply seems to copy more then objects in
    current environment and the above one level environment. I am very
    confused.and just guess it was copying everything.
    
    Thanks for any help!
    Best,
    Lulu
    
        [[alternative HTML version deleted]]
    
    _______________________________________________
    Bioc-devel@r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/bioc-devel
    
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to