On Tuesday 16 June 2009 02:29:14 pm Matthew Knepley wrote: > On Tue, Jun 16, 2009 at 1:13 PM, Alex Peyser <peyser.alex at gmail.com> wrote: > > On Tuesday 16 June 2009 01:53:35 pm Matthew Knepley wrote: > > > On Tue, Jun 16, 2009 at 12:38 PM, xiaoyin ji > > > <sapphire.jxy at gmail.com<mailto:sapphire.jxy at gmail.com>> wrote: Hi > > > there, > > > > > > I'm using PETSc MATMPIAIJ and ksp solver. It seems that PETSc will run > > > obviously faster if I set the number of CPUs close to the number of > > > computer nodes in the job file. By default MPIAIJ matrix is stored in > > > different processors and ksp solver will communicate for each step, > > > however since on each node several CPUs share the same memory while > > > ksp may still try to communicate through network card, this may mess > > > up a bit. Is there any way to detect which CPUs are sharing the same > > > memory? Thanks a lot. > > > > > > The interface for this is mpirun or the job submission mechanism. > > > > > > Matt > > > > > > > > > Best, > > > Xiaoyin Ji > > > -- > > > What most experimenters take for granted before they begin their > > > experiments is infinitely more interesting than any results to which > > > > their > > > > > experiments lead. -- Norbert Wiener > > > > I had a question on what is the best approach for this. Most of the time > > is spent inside of BLAS, correct? So wouldn't you maximize your > > operations by running one MPI/PETSC job per board (per shared memory), > > and use a multi-threaded BLAS that matches your board? You should cut > > down communications by some factor proportional to the number of threads > > per board, and the BLAS itself should better optimize most of your > > operations across the board, rather than relying on higher order > > parallelisms. > > This is a common misconception. In fact, most time is spent in MatVec or > BLAS1, neither of which benefit from MT BLAS. > > Matt > > > Regards, > > Alex Peyser
Interesting. At least my misconception is common. That makes things tricky with ATLAS, since the number of threads is a compile-time constant. I can't imagine it would be a good idea to have an 8x BLAS running 8xs simultaneously -- unless the mpi jobs were all unsynchronized. It may be only 10-20% of the time, but that's still a large overlap of conflicting threads degrading performance. I'll have to do some benchmarks. Is the 10-20% number still true for fairly dense matrices? Ah, another layer of administration-code may now be required to properly allocate jobs. Alex -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20090616/bbf6ad85/attachment.pgp>
