This is the sort of thing that should be measured, rather than
speculated about, but if you're using multicore all those subsets can
be made at the same time, not sequentially, so they add up to a copy
of the whole data.   Using data.table rather than a data.frame would
help, of course.

I would guess that splitting, garbage collecting, and then forking
would be most efficient -- reducing the chance that all the separate
processes end up separately garbage collecting the results of the
split.

It's a pity that forking messes up the profilers; makes it harder to
measure these things.

    -thomas


On Tue, Oct 11, 2011 at 9:14 AM, Joshua Wiley <jwiley.ps...@gmail.com> wrote:
> I could be waay off base here, but my concern about presplitting the data is 
> that you will have your data, and a second copy of our data that is something 
> like a list where each element contains the portion of the data for that 
> split.  Good speed wise, bad memory wise.  My hope with the technique I 
> showed (again I may not have accomplished it) was to only have at anyone 
> time, the original data and a copy of the particular elements being worked 
> with.  Of course  this is not an issue if you have plenty of memory.
>
> On Oct 10, 2011, at 12:19, Thomas Lumley <tlum...@uw.edu> wrote:
>
>> On Tue, Oct 11, 2011 at 7:54 AM, ivo welch <ivo.we...@gmail.com> wrote:
>>> hi josh---thx.  I had a different version of this, and discarded it
>>> because I think it was very slow.  the reason is that on each
>>> application, your version has to scan my (very long) data vector.  (I
>>> have many thousand different cases, too.)  I presume that by() has one
>>> scan through the vector that makes all splits.
>>
>> by.data.frame() is basically a wrapper for tapply(), and the key line
>> in tapply() is
>>   ans <- lapply(split(X, group), FUN, ...)
>> which should be easy to adapt for mclapply.
>>
>> --
>> Thomas Lumley
>> Professor of Biostatistics
>> University of Auckland
>



-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to