Thanks for the detailed analysis Simon. I figured out a workaround that
seems to be working in my real application. By limiting the length of the
first argument to mclapply (to the number of cores), I get speedups while
limiting the memory overhead.

### Run mclapply inside of a for loop, ensuring that it never receives
### a first argument with a length more than maxjobs. This avoids some
### memory problems (swapping, or getting jobs killed on the cluster)
### when using mclapply(1:N, FUN) where N is large.
maxjobs.mclapply <- function(X, FUN, maxjobs=getOption("mc.cores")){
  N <- length(X)
  i.list <- splitIndices(N, N/maxjobs)
  result.list <- list()
  for(i in seq_along(i.list)){
    i.vec <- i.list[[i]]
    result.list[i.vec] <- mclapply(X[i.vec], FUN)

> > I am running mclapply with many iterations over a function that modifies
> > nothing and makes no copies of anything. It is taking up a lot of memory,
> > so it seems to me like this is a bug. Should I post this to
> > A minimal reproducible example can be obtained by first starting a memory
> > monitoring program such as htop, and then executing the following code
> > while looking at how much memory is being used by the system
> >
> > library(parallel)
> > seconds <- 5
> > N <- 100000
> > result.list <- mclapply(1:N, function(i)Sys.sleep(1/N*seconds))
> >
> > On my system, memory usage goes up about 60MB on this example. But it
> does
> > not go up at all if I change mclapply to lapply. Is this a bug?
> >
> > For a more detailed discussion with a figure that shows that the memory
> I'm not quite sure what is supposed to be the issue here. One would expect
> the memory used will be linear in the number elements you process - by
> definition of the task, since you'll be creating linearly many more objects.
> Also using top doesn't actually measure the memory used by R itself (see
> FAQ 7.42).
> That said, I re-run your script and it didn't look anything like what you
> have on your webpage.  For the NULL result you end up dealing will all the
> objects you create in your test that overshadow any memory usage and
> stabilizes after garbage-collection. As you would expect, any output of top
> is essentially bogus up to a gc. How much memory R will use is essentially
> governed by the level at which you set the gc trigger. In real world you
> actually want that to be fairly high if you can afford it (in gigabytes,
> not megabytes), because you get often much higher performance by delaying
> gcs if you don't have low total memory (essentially using the memory as a
> buffer). Given that the usage is so negligible, it won't trigger any gc on
> its own, so you're just measuring accumulated objects - which will be
> always higher for mclapply because of the bookkeeping and serialization
> involved in the communication.
> The real difference is only in the df case. The reason for it is that your
> lapply() there is simply a no-op, because R is smart enough to realize that
> you are always returning the same object so it won't actually create
> anything and just return a reference back to df - thus using no memory at
> all. However, once you split the inputs, your main session can no longer
> perform this optimization because the processing is now in a separate
> process, so it has no way of knowing that you are returning the object
> unmodified. So what you are measuring is a special case that is arguably
> not really relevant in real applications.
> >
        [[alternative HTML version deleted]]

