The minimize the additional memory used by mclapply, remember that mclapply works by forking processes, and the advantage of this is that as long as an object is not modified in either the parent or child, they will share the memory for that object, which effectively means that a child process *only* uses a significant amount of memory when it modifies existing objects (triggering creation of a copy) or creates a new object.

In your case, there's no point in splitting the data (which results in creating copies). You only have to split the indices using parallel::splitIndices. I've tried to incorporate this into your gist: https://gist.github.com/DarwinAwardWinner/7463652

The key line is:

res4 <- mclapply(splitIndices(nrow(data), opt$mcores), function(i) rowMeans(data[i,]), mc.cores=opt$mcores)

Also, for concatenating the results, you can use "do.call(c, unname(res4))".

On Thu Nov 14 00:13:41 2013, Leonardo Collado Torres wrote:
Dear BioC developers,

I am trying to understand how to use mclapply() without blowing up the
memory usage and need some help.

My use case is splitting a large IRanges::DataFrame() into chunks, and
feeding these chunks to mclapply(). Let say that I am using n cores and
that the operation I am doing uses K memory units.

I understand that the individual jobs in mclapply() cannot detect how the
others are doing and if they need to run gc(). While this coupled n * K
  could explain a higher memory usage, I am running into higher than
expected memory loads.

I have tried
1) pre-splitting the data into a list (one element per chunk),
2) assigning the elements of the list as elements of an environment and the
using mclapply() over a set of indexes,
3) saving each chunk on its own Rdata file, then using mclapply with a
function that loads the appropriate chunk and then performs the operation
of interest.

Strategy 3 performs best in terms of max memory usage, but I am afraid that
it is more error prone due to having to write to disk.

Do you have any other ideas/tips on how to reduce the memory load? In other
words, is there a strategy to reduce the number of copies as much as
possible when using mclapply()?


I have a full example (with data.frame instead of DataFrame) and code
comparing the three options described above at http://bit.ly/1ar71yA


Thank you,
Leonardo

Leonardo Collado Torres, PhD student
Department of Biostatistics
Johns Hopkins University
Bloomberg School of Public Health
Website: http://www.biostat.jhsph.edu/~lcollado/
Blog: http://bit.ly/FellBit

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to