Mohammad,
Short term for what you can do NOW.
PETSc wants to have one MPI process for core; so start up the program that
way, (say there are 16 cores per node. In the block of code that does your
"non-MPI stuff" do
if ((rank % 16) == 0) { /* this code is only running on one MPI process
per node */
build your mesh grid data structure and process it, use MPI pramas
whatever to parallelize that computation, have a big data structure */
}
have the (rank % 16) == 0 MPI processes send the grid information to each
(rank % 16) == j MPI process the part of the grid information they need.
have the (rank % 16) == 0 MPI processes delete the global big data
structure it built.
The rest of the program runs as a regular MPI PETSc program
The advantage of this approach is 1) it will run today 2) it doesn't
depend on any fragile os features or software. The disadvantage is that you
need to figure out what part of the grid data each process needs and ship it
from the (rank % 16) == 0 MPI processes.
Barry
On Apr 20, 2012, at 1:31 PM, Mohammad Mirzadeh wrote:
> Hi guys,
>
> I have seen multiple emails regarding this in the mailing list and I'm afraid
> you might have already answered this question but I'm not quite sure!
>
> I have objects in my code that are hard(er) to parallelize using MPI and so
> far my strategy has been to just handle them in serial such that each process
> has a copy of the whole thing. This object is related to my grid
> generation/information etc so it only needs to be done once at the beginning
> (no moving mesh for NOW). As a result I do not care much about the speed
> since its nothing compared to the overall solution time. However, I do care
> about the memory that this object consumes and can limit my problem size.
>
> So I had the following idea the other day. Is it possible/good idea to
> paralleize the grid generation using OpenMP so that each node (as opposed to
> core) would share the data structure? This can save me a lot since memory on
> nodes are shared among cores (e.g. 32 GB/node vs 2GB/core on Ranger). What
> I'm not quite sure about is how the job is scheduled when running the code
> via mpirun -n Np. Should Np be the total number of cores or nodes?
>
> If I use, say Np = 16 processes on one node, MPI is running 16 versions of
> the code on a single node (which has 16 cores). How does OpenMP figure out
> how to fork? Does it fork a total of 16 threads/MPI process = 256 threads or
> is it smart to just fork a total of 16 threads/node = 1 thread/core = 16
> threads? I'm a bit confused here how the job is scheduled when MPI and OpenMP
> are mixed?
>
> Do I make any sense at all?!
>
> Thanks