Thanks for the detailed analysis Simon. I figured out a workaround that seems to be working in my real application. By limiting the length of the first argument to mclapply (to the number of cores), I get speedups while limiting the memory overhead.
### Run mclapply inside of a for loop, ensuring that it never receives ### a first argument with a length more than maxjobs. This avoids some ### memory problems (swapping, or getting jobs killed on the cluster) ### when using mclapply(1:N, FUN) where N is large. maxjobs.mclapply <- function(X, FUN, maxjobs=getOption("mc.cores")){ N <- length(X) i.list <- splitIndices(N, N/maxjobs) result.list <- list() for(i in seq_along(i.list)){ i.vec <- i.list[[i]] result.list[i.vec] <- mclapply(X[i.vec], FUN) } result.list } On Thu, Sep 3, 2015 at 5:27 PM, Simon Urbanek <simon.urba...@r-project.org> wrote: > Toby, > > > On Sep 2, 2015, at 1:12 PM, Toby Hocking <tdho...@gmail.com> wrote: > > > > Dear R-devel, > > > > I am running mclapply with many iterations over a function that modifies > > nothing and makes no copies of anything. It is taking up a lot of memory, > > so it seems to me like this is a bug. Should I post this to > > bugs.r-project.org? > > > > A minimal reproducible example can be obtained by first starting a memory > > monitoring program such as htop, and then executing the following code > > while looking at how much memory is being used by the system > > > > library(parallel) > > seconds <- 5 > > N <- 100000 > > result.list <- mclapply(1:N, function(i)Sys.sleep(1/N*seconds)) > > > > On my system, memory usage goes up about 60MB on this example. But it > does > > not go up at all if I change mclapply to lapply. Is this a bug? > > > > For a more detailed discussion with a figure that shows that the memory > > overhead is linear in N, please see > > https://github.com/tdhock/mclapply-memory > > > > > I'm not quite sure what is supposed to be the issue here. One would expect > the memory used will be linear in the number elements you process - by > definition of the task, since you'll be creating linearly many more objects. > > Also using top doesn't actually measure the memory used by R itself (see > FAQ 7.42). > > That said, I re-run your script and it didn't look anything like what you > have on your webpage. For the NULL result you end up dealing will all the > objects you create in your test that overshadow any memory usage and > stabilizes after garbage-collection. As you would expect, any output of top > is essentially bogus up to a gc. How much memory R will use is essentially > governed by the level at which you set the gc trigger. In real world you > actually want that to be fairly high if you can afford it (in gigabytes, > not megabytes), because you get often much higher performance by delaying > gcs if you don't have low total memory (essentially using the memory as a > buffer). Given that the usage is so negligible, it won't trigger any gc on > its own, so you're just measuring accumulated objects - which will be > always higher for mclapply because of the bookkeeping and serialization > involved in the communication. > > The real difference is only in the df case. The reason for it is that your > lapply() there is simply a no-op, because R is smart enough to realize that > you are always returning the same object so it won't actually create > anything and just return a reference back to df - thus using no memory at > all. However, once you split the inputs, your main session can no longer > perform this optimization because the processing is now in a separate > process, so it has no way of knowing that you are returning the object > unmodified. So what you are measuring is a special case that is arguably > not really relevant in real applications. > > Cheers, > Simon > > > > >> sessionInfo() > > R version 3.2.2 (2015-08-14) > > Platform: x86_64-pc-linux-gnu (64-bit) > > Running under: Ubuntu precise (12.04.5 LTS) > > > > locale: > > [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_CA.UTF-8 > > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_CA.UTF-8 > > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] parallel graphics utils datasets stats grDevices methods > > [8] base > > > > other attached packages: > > [1] ggplot2_1.0.1 RColorBrewer_1.0-5 lattice_0.20-33 > > > > loaded via a namespace (and not attached): > > [1] Rcpp_0.11.6 digest_0.6.4 MASS_7.3-43 > > [4] grid_3.2.2 plyr_1.8.1 gtable_0.1.2 > > [7] scales_0.2.3 reshape2_1.2.2 proto_1.0.0 > > [10] labeling_0.2 tools_3.2.2 stringr_0.6.2 > > [13] dichromat_2.0-0 munsell_0.4.2 > PeakSegJoint_2015.08.06 > > [16] compiler_3.2.2 colorspace_1.2-4 > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel