On Tue, 16 Jun 2009, Alex Peyser wrote: > I had a question on what is the best approach for this. Most of the time is > spent inside of BLAS, correct?
Not really. PETSc uses a bit of blas1 operations - that should poerhaps account for arround 10-20% of runtime [depending upon application. Check for Vec operations in -log_summary. They are usually blas calls] > So wouldn't you maximize your operations by > running one MPI/PETSC job per board (per shared memory), and use a > multi-threaded BLAS that matches your board? You should cut down > communications by some factor proportional to the number of threads per > board, and the BLAS itself should better optimize most of your operations > across the board, rather than relying on higher order parallelisms. If the issue is memorybandwidth - then it affects threads or processes [MPI] equally. And if the algorithm needs some data sharing - there is cost associated with explicit communication [MPI] vs implicit data-sharing [shared memory] due to cache conflcits and other synchronization thats required.. There could be implementation inefficiencies between threads vs procs, mpi vs openmp that might tilt things in favor of one approach or the other - But I don't think it should be big margin.. Satish
