Dear Jed, For multithread support in PETSc, my question is whether KSP and/or PC work when Vec and Mat use multithread mode.
Thanks, Yujie On Sun, Feb 26, 2012 at 6:39 PM, Jed Brown <jedbrown at mcs.anl.gov> wrote: > On Sun, Feb 26, 2012 at 04:07, Gerard Gorman <g.gorman at > imperial.ac.uk>wrote: > >> > Did you post a repository yet? I'd like to have a look at the code. >> >> It's on Barry's favourite collaborative software development site of >> course ;-) >> >> https://bitbucket.org/wence/petsc-dev-omp/overview >> > > I looked through the code and I'm concerned that all the OpenMP code is > inlined into vec/impls/seq and mat/impls/aij/seq with, as far as I can > tell, no way to use OpenMP for some objects in a simulation, but not > others. I think that all the pragmas should have num_threads(x->nthreads) > clauses. We can compute the correct number of threads based on sizes when > memory is allocated (or specified through command line options, inherited > from related objects, etc). > > I don't think we can get away from OpenMP schedule overhead (several > hundred cycles) even for those objects that we choose not to use threads > for, but (at least with my gcc tests), that overhead is only about a third > of the cost of actually starting a parallel region. > > It's really not acceptable to insert unguarded > > #pragma omp ... > > into the code because this will generate tons of warnings or errors with > compilers that don't know about OpenMP. It would be better to test for > _Pragma and use > > #define PetscPragmatize(x) _Pragma(#x) > #if defined(PETSC_USE_OPENMP) > # define PetscPragmaOMP(x) PetscPragmatize(omp x) > #else > # define PetscPragmaOMP(x) > #endif > > then use > > PetscPragmaOMP(parallel for ...) > > We should probably use a variant for object-based threading > > #define PetscPragmaOMPObject(obj,x) PetscPragmaOMP(x > num_threads((obj)->nthreads)) > > In the case of multiple objects, I think you usually want the object being > modified to control the number of threads. > > In many cases, I would prefer more control over the partition of the loop. > For example, in many cases, I'd be willing to tolerate a slight > computational imbalance between threads in exchange for working exclusively > within my page. Note that the arithmetic to compute such things is orders > of magnitude less expensive than the schedule/distribution to threads. I > don't know how to do that except to > > PragmaOMP(parallel) { > int nthreads = omp_get_num_threads(); > int tnum = omp_get_thread_num(); > int start,end; > // compute start and end > for (int i=start; i<end; i++) { > // the work > } > } > > We could perhaps capture some of this common logic in a macro: > > #define VecOMPParallelBegin(X,args) do { \ > PragmaOMPObject(X,parallel args) { \ > PetscInt _start, _end; \ > VecOMPGetThreadLocalPart(X,&_start,&_end); \ > { do {} while(0) > > #define VecOMPParallelEnd() }}} while(0) > > VecOMPParallelBegin(X, shared/private ...); > { > PetscInt i; > for (i=_start; i<_end; i++) { > // the work > } > } > VecOMPParallelEnd(); > > That should reasonably give us complete run-time control of the number of > parallel threads per object and their distribution, within the constraints > of contiguous thread partition. That also leaves open the possibility of > using libnuma to query and migrate pages. (For example, a short vector that > needs to be accessed from multiple NUMA nodes might intentionally be > faulted with pages spread apart even though other vectors of similar size > might be accessed from within one NUMA nodes and thus not use threads at > all. (One 4 KiB page is only 512 doubles, but if the memory is local to a > single NUMA node, we wouldn't use threads until the vector length was 4 to > 8 times larger.) > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120226/c87f9ab1/attachment.html>
