Dear BioC developers, I am trying to understand how to use mclapply() without blowing up the memory usage and need some help.
My use case is splitting a large IRanges::DataFrame() into chunks, and feeding these chunks to mclapply(). Let say that I am using n cores and that the operation I am doing uses K memory units. I understand that the individual jobs in mclapply() cannot detect how the others are doing and if they need to run gc(). While this coupled n * K could explain a higher memory usage, I am running into higher than expected memory loads. I have tried 1) pre-splitting the data into a list (one element per chunk), 2) assigning the elements of the list as elements of an environment and the using mclapply() over a set of indexes, 3) saving each chunk on its own Rdata file, then using mclapply with a function that loads the appropriate chunk and then performs the operation of interest. Strategy 3 performs best in terms of max memory usage, but I am afraid that it is more error prone due to having to write to disk. Do you have any other ideas/tips on how to reduce the memory load? In other words, is there a strategy to reduce the number of copies as much as possible when using mclapply()? I have a full example (with data.frame instead of DataFrame) and code comparing the three options described above at http://bit.ly/1ar71yA Thank you, Leonardo Leonardo Collado Torres, PhD student Department of Biostatistics Johns Hopkins University Bloomberg School of Public Health Website: http://www.biostat.jhsph.edu/~lcollado/ Blog: http://bit.ly/FellBit [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel