In our present system of equations, we sweep 8 loosely coupled PDEs defined in 
5 meshes, until their residues die down to an acceptable value.

Upon profiling the code, it's clear that the sweep function is a major 
computational bottleneck. We tried MPI parallelism, but that didn't work out. 
But to side-step this, recognising the fact that the sweep() functions are 
computed one-by-one in serial, we propose an interesting work-around for an 
alternative style of parallelism.

Instead of dividing the control volumes among multiple CPUs, we instead propose 
to split the 8 different sweep calls (in the sweep loop within a time-step) to 
worker cores. Each sweep can then occur independently (instead of the python 
interpreter waiting for one sweep to finish before beginning another).  I 
suspect that this would be beneficial for those problems with 6-8 sweep calls, 
wherein the sweeps are CPU intensive. Thus, the cost of communication overhead 
between the master process to the child workers is offset by the benefits 
gained by independent sweeping.

>From my fairly underwhelming numerical computing knowledge, this looks like 
>analogous to a Jacobi-style implementation, rather than a Gauss-Seidel style 
>iteration, wherein the updated cell-variable from the sweep is immediately 
>available for all subsequent PDEs of the coupled system within the sweep-loop. 
> In the case of multi-processor sweep, the values are updated using values 
>from last sweep iteration.  Thus, this might be slower (but more stable) to 
>converge, isn't it ?  There is also a penalty incurred by the communication 
>overhead of opening and killing processes.

The questions are : 1) Has anybody tried this before ? 2) Does this sound 
problematic from a technical or practical perspective. 3) Does this sound 
remotely useful.

I have successfully implemented a multiprocessor parallelisation for 
examples.diffusion.coupled (shown below) . The values of the simulation results 
(for v0 & v1) after 100 time-steps are very very close to that of the serial 
sweep. Although this code is several times slower than the serial code, we 
tried this more as a concept demonstrator before embarking to spend significant 
effort to convert our serial code to multiprocessor approach.  If you see a red 
flag, it would be much appreciated if you can help us by pointing it out.

from fipy import Grid1D, CellVariable, TransientTerm, DiffusionTerm, Viewer
from multiprocessing import Process, Queue, cpu_count

m = Grid1D(nx=100, Lx=1.)

v0 = CellVariable(mesh=m, hasOld=True, value=0.5)
v1 = CellVariable(mesh=m, hasOld=True, value=0.5)

def boundaryConditions(m,v0,v1):
    v0.constrain(0, m.facesLeft)
    v0.constrain(1, m.facesRight)
    v1.constrain(1, m.facesLeft)
    v1.constrain(0, m.facesRight)

###### Multiprocessor implementation 
def get_res0(res0_queue,v0,v1):
    boundaryConditions(m, v0, v1)
    eq0 = TransientTerm() == DiffusionTerm(coeff=0.01) - v1.faceGrad.divergence
    res0 = eq0.sweep(var=v0, dt=1e-5)
def get_res1(res1_queue,v0,v1):
    boundaryConditions(m, v0, v1)
    eq1 = TransientTerm() == v0.faceGrad.divergence + DiffusionTerm(coeff=0.01)
    res1 = eq1.sweep(var=v1, dt=1e-5)

if __name__ == '__main__':
    print "You have " + str(cpu_count()) + " CPU cores in your machine"
    print "Simulation end time is 99 seconds.\n"
    res0_queue = Queue()
    res1_queue = Queue()
    for t in range(100):
        print "Computing the solution at time, t = " + str(t) + " sec"
        res0 = res1 = 1e6
        while max(res0,res1) > 0.01:
            process_a = Process(target=get_res0, args=(res0_queue,v0,v1))
            process_b = Process(target=get_res1, args=(res1_queue,v0,v1))
            while not res0_queue.empty():
                res0, v0 =  res0_queue.get()
            while not res1_queue.empty():
                res1, v1 =  res1_queue.get()
    print v0, v1
    print res0, res1

fipy mailing list
  [ NIST internal ONLY: https://email.nist.gov/mailman/listinfo/fipy ]

Reply via email to