Re: [petsc-dev] Hybrid MPI/OpenMP reflections

Michael Lange Fri, 09 Aug 2013 06:32:52 -0700

Hi,

I just had a look at the threaded version of MatMult_SeqAIJ and I thinkthe Flops logging might be incorrect, because the nonzerorows aren'tcounted in MatMult_SeqAIJ_Kernel. Fixing this in the thread kernel wouldrequire a reduction though, which could impact performance. Is this aknown problem, or is there a better way to compute Flops, which doesn'trequire the nonzerorows?

Alternatively, would it make sense to pre-compute the nonzerorows andstore them in the threadcomm? This might require more of the AIJ datastructure to be exposed to PetscLayoutSetUp /PetscThreadCommGetOwnershipRanges though.


Regards,
Michael

On 08/08/13 12:08, Matthew Knepley wrote:

On Thu, Aug 8, 2013 at 5:37 AM, Michael Lange<[email protected] <mailto:[email protected]>>wrote:


    Hi,

    We have recently been trying to re-align our OpenMP fork
    (https://bitbucket.org/ggorman/petsc-3.3-omp) with petsc/master.
    Much of our early work has now been superseded by the threadcomm
    implementations. Nevertheless, there are still a few algorithmic
    differences between the two branches:

    1) Enforcing MPI latency hiding by using task-based spMV:
    If the MPI implementation used does not actually provide truly
    asynchronous communication in hardware, performance can be
    increased by dedicating a single thread to overlapping MPI
    communication in PETSc. However, this is arguably a
    vendor-specific fix which requires significant code changes (ie
    the parallel section needs to be raised up by one level). So
    perhaps the strategy should be to give guilty vendors a hard time
    rather than messing up the current abstraction.

    2) Nonzero-based thread partitioning:
    Rather than evenly dividing the number of rows among threads, we
    can partition the thread ownership ranges according to the number
    of non-zeros in each row. This balances the work load between
    threads and thus increases strong scalability due to optimised
    bandwidth utilisation. In general, this optimisation should
    integrate well with threadcomms, since it only changes the thread
    ownership ranges, but it does require some structural changes
    since nnz is currently not passed to PetscLayoutSetUp. Any
    thoughts on whether people regard such a scheme as useful would be
    greatly appreciated.

I think this should be handled by changing the AIJ data structure.Going all the way to "2D" partitions would also allowus to handle power-law matrix graphs. This would keep the threadimplementation simple, and at the same time be more

flexible.

   Matt

    3) MatMult_SeqBAIJ not threaded:
    Is there a reason why MatMult has not been threaded for BAIJ
    matrices, or is somebody already working on this? If not, I would
    like to prepare a pull request for this using the same approach as
    MatMult_SeqAIJ.

    We would welcome any suggestions/feedback on this, in particular
    the second point. Up to date benchmarking results for the first
    two methods, including BlueGene/Q, can be found in:
    http://arxiv.org/abs/1307.4567

    Kind regards,

    Michael Lange




--

What most experimenters take for granted before they begin theirexperiments is infinitely more interesting than any results to whichtheir experiments lead.

-- Norbert Wiener

Re: [petsc-dev] Hybrid MPI/OpenMP reflections

Reply via email to