Hi Michael,
> We have recently been trying to re-align our OpenMP fork
(https://bitbucket.org/ggorman/petsc-3.3-omp) with petsc/master. Much of
our early work has now been superseded by the threadcomm
implementations. Nevertheless, there are still a few algorithmic
differences between the two branches:
1) Enforcing MPI latency hiding by using task-based spMV:
If the MPI implementation used does not actually provide truly
asynchronous communication in hardware, performance can be increased by
dedicating a single thread to overlapping MPI communication in PETSc.
However, this is arguably a vendor-specific fix which requires
significant code changes (ie the parallel section needs to be raised up
by one level). So perhaps the strategy should be to give guilty vendors
a hard time rather than messing up the current abstraction.
When using good preconditioners, spMV is essentially never the
bottleneck and hence I don't think a separate communication thread
should be implemented in PETSc. Instead, such a fallback should be part
of a good MPI implementation.
2) Nonzero-based thread partitioning:
Rather than evenly dividing the number of rows among threads, we can
partition the thread ownership ranges according to the number of
non-zeros in each row. This balances the work load between threads and
thus increases strong scalability due to optimised bandwidth
utilisation. In general, this optimisation should integrate well with
threadcomms, since it only changes the thread ownership ranges, but it
does require some structural changes since nnz is currently not passed
to PetscLayoutSetUp. Any thoughts on whether people regard such a scheme
as useful would be greatly appreciated.
This is a reasonable optimization, I used a similar strategy for sparse
matrices on the GPU. Others should comment on whether the interface
change to PetscLayoutSetUp is acceptable.
3) MatMult_SeqBAIJ not threaded:
Is there a reason why MatMult has not been threaded for BAIJ matrices,
or is somebody already working on this? If not, I would like to prepare
a pull request for this using the same approach as MatMult_SeqAIJ.
To my knowledge, it 'simply hasn't been implemented yet'. A pull request
would be nice, I'm happy to review.
Best regards,
Karli