On Wednesday, 13 May 2015 at 12:16:19 UTC, weaselcat wrote:
On Wednesday, 13 May 2015 at 09:01:05 UTC, Gerald Jansen wrote:
On Wednesday, 13 May 2015 at 03:19:17 UTC, thedeemon wrote:
In case of Python's parallel.Pool() separate processes do the
work without any synchronization issues. In case of D's
std.parallelism it's just threads inside one process and they
do fight for some locks, thus this result.
Okay, so to do something equivalent I would need to use
std.process. My next question is how to pass the common data
to the sub-processes. In the Python approach I guess this is
automatically looked after by pickling serialization. Is there
something similar in D? Alternatively, would the use of
std.mmfile to temporarily store the common data be a
reasonable approach?
Assuming you're on a POSIX compliant platform, you would just
take advantage of fork()'s shared memory model and pipes - i.e,
read the data, then fork in a loop to process it, then use
pipes to communicate. It ran about 3x faster for me by doing
this, and obviously scales with the workloads you have(the
provided data only seems to have 2.) If you could provide a
larger dataset and the python implementation, that would be
great.
I'm actually surprised and disappointed that there isn't a
fork()-backend to std.process OR std.parallel. You have to use
stdc
Okay, more studying...
The python implementation is part of a larger package so it would
be a fair bit of work to provide a working version. Anyway, the
salient bits are like this:
from parallel import Pool
def run_job(args):
(job, arr1, arr2) = args
# ... do the work for each dataset
def main():
# ... read common data and store in numpy arrays...
pool = Pool()
pool.map(run_job, [(job, arr1, arr2) for job in jobs])