Re: [sage-support] parallelize for-loop with side effects

William A Stein Mon, 04 Aug 2014 07:20:07 -0700

On Mon, Aug 4, 2014 at 2:38 AM, Christian Stump
<[email protected]> wrote:
> Hi there,
>
> I wonder how to parallelize the following scenario.
>
> I have a method that initiates a (not very simple) data strucure and then
> runs a for-loop (of, say, length 1,000-20,000) to populate that data
> structure with data. The computations in each loop is not trivial, but
> fairly optimized using cython. All iteration steps done serially take a few
> secs (about 2 or 3). Nevertheless, the computations are fairly independent
> and I would like to do them in parallel.
>
> If I extract the content of the for-loop into an @parallel(2) decorated
> function, it still seems to be using only one cpu to do the computation
> (why?),


It absolutely will use two additional *processes*, as you might see by
watching with htop, top, or using ps.
Whether or not the operating system runs those processes in parallel
depends on many things, e.g., do you
have two processors?  Is your user allowed to use both of them fully
(not by default on SageMathCloud, say)?  Etc.

> but all the forking takes tons of time (i.e., including 80secs for
> posix.wait and 15 for posix.fork).

20,000 forks should take about 80 seconds.  That sounds about right.
The fork OS system call takes a few milliseconds -- multiply that by
20,000 and you get about 80 seconds.

> If I read the documentation right, this
> is due to the issue that every computation is done in a subprocess itself
> and the data structure is also forked and passed to the subprocess. Is that
> correct?

Yes-ish.  Just to be clear there is one single fork that happens,
which means that
(almost) all state of the process is inherited by the subprocess.

> If I use @parallel('reference',2) instead (without knowing what that
> actually does), it is again as quick as in the beginning but also uses only
> a single cpu.

That fakes @parallel -- providing the same API -- but actually running
everything in serial in the parent process.  No forking or anything
else happens.  It's for testing and development purposes.

> What am I doing wrong here? Does anyone know how I should handle such a (I
> suspect not very uncommon) situation?

Break up your computation into far less than 20,000 separate steps,
then us @parallel.  For example, if your 20,000 steps are "compute
f(n)" for n in range(20000), instead do "compute f(1) through f(1000)
as the first step, then compute f(1001) through f(2000) as the next
step", etc.

For example, if you wanted to use @parallel to factor the integers
[1..20000], you would do this:

@parallel
def f(m):
    return [factor(k) for k in range(1000*m, 1000*(m+1)) if k]

t = []
for x in f([1..20]):
    print x[0]
    t.append(x)

-- 
You received this message because you are subscribed to the Google Groups 
"sage-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/sage-support.
For more options, visit https://groups.google.com/d/optout.

Re: [sage-support] parallelize for-loop with side effects

Reply via email to