Dear BioC developers,

I am trying to understand how to use mclapply() without blowing up the
memory usage and need some help.

My use case is splitting a large IRanges::DataFrame() into chunks, and
feeding these chunks to mclapply(). Let say that I am using n cores and
that the operation I am doing uses K memory units.

I understand that the individual jobs in mclapply() cannot detect how the
others are doing and if they need to run gc(). While this coupled n * K
 could explain a higher memory usage, I am running into higher than
expected memory loads.

I have tried
1) pre-splitting the data into a list (one element per chunk),
2) assigning the elements of the list as elements of an environment and the
using mclapply() over a set of indexes,
3) saving each chunk on its own Rdata file, then using mclapply with a
function that loads the appropriate chunk and then performs the operation
of interest.

Strategy 3 performs best in terms of max memory usage, but I am afraid that
it is more error prone due to having to write to disk.

Do you have any other ideas/tips on how to reduce the memory load? In other
words, is there a strategy to reduce the number of copies as much as
possible when using mclapply()?


I have a full example (with data.frame instead of DataFrame) and code
comparing the three options described above at http://bit.ly/1ar71yA


Thank you,
Leonardo

Leonardo Collado Torres, PhD student
Department of Biostatistics
Johns Hopkins University
Bloomberg School of Public Health
Website: http://www.biostat.jhsph.edu/~lcollado/
Blog: http://bit.ly/FellBit

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to