Michael,
Thanks for reporting this. I have refactored the code to compute the
nonzero rows count in MatAssemblyEnd_SeqAIJ() and use it when logging the flops
so it will work with the threaded code. Also fixed for SeqBAIJ matrices. It is
undergoing code review at
https://bitbucket.org/petsc/petsc/pull-request/77/fix-matmult_seqaij-flop-count-and-simplify
Barry
On Aug 9, 2013, at 8:31 AM, Michael Lange <[email protected]> wrote:
> Hi,
>
> I just had a look at the threaded version of MatMult_SeqAIJ and I think the
> Flops logging might be incorrect, because the nonzerorows aren't counted in
> MatMult_SeqAIJ_Kernel. Fixing this in the thread kernel would require a
> reduction though, which could impact performance. Is this a known problem, or
> is there a better way to compute Flops, which doesn't require the nonzerorows?
>
> Alternatively, would it make sense to pre-compute the nonzerorows and store
> them in the threadcomm? This might require more of the AIJ data structure to
> be exposed to PetscLayoutSetUp / PetscThreadCommGetOwnershipRanges though.
>
> Regards,
> Michael
>
> On 08/08/13 12:08, Matthew Knepley wrote:
>> On Thu, Aug 8, 2013 at 5:37 AM, Michael Lange <[email protected]>
>> wrote:
>> Hi,
>>
>> We have recently been trying to re-align our OpenMP fork
>> (https://bitbucket.org/ggorman/petsc-3.3-omp) with petsc/master. Much of our
>> early work has now been superseded by the threadcomm implementations.
>> Nevertheless, there are still a few algorithmic differences between the two
>> branches:
>>
>> 1) Enforcing MPI latency hiding by using task-based spMV:
>> If the MPI implementation used does not actually provide truly asynchronous
>> communication in hardware, performance can be increased by dedicating a
>> single thread to overlapping MPI communication in PETSc. However, this is
>> arguably a vendor-specific fix which requires significant code changes (ie
>> the parallel section needs to be raised up by one level). So perhaps the
>> strategy should be to give guilty vendors a hard time rather than messing up
>> the current abstraction.
>>
>> 2) Nonzero-based thread partitioning:
>> Rather than evenly dividing the number of rows among threads, we can
>> partition the thread ownership ranges according to the number of non-zeros
>> in each row. This balances the work load between threads and thus increases
>> strong scalability due to optimised bandwidth utilisation. In general, this
>> optimisation should integrate well with threadcomms, since it only changes
>> the thread ownership ranges, but it does require some structural changes
>> since nnz is currently not passed to PetscLayoutSetUp. Any thoughts on
>> whether people regard such a scheme as useful would be greatly appreciated.
>>
>> I think this should be handled by changing the AIJ data structure. Going all
>> the way to "2D" partitions would also allow
>> us to handle power-law matrix graphs. This would keep the thread
>> implementation simple, and at the same time be more
>> flexible.
>>
>> Matt
>>
>> 3) MatMult_SeqBAIJ not threaded:
>> Is there a reason why MatMult has not been threaded for BAIJ matrices, or is
>> somebody already working on this? If not, I would like to prepare a pull
>> request for this using the same approach as MatMult_SeqAIJ.
>>
>> We would welcome any suggestions/feedback on this, in particular the second
>> point. Up to date benchmarking results for the first two methods, including
>> BlueGene/Q, can be found in:
>> http://arxiv.org/abs/1307.4567
>>
>> Kind regards,
>>
>> Michael Lange
>>
>>
>>
>> --
>> What most experimenters take for granted before they begin their experiments
>> is infinitely more interesting than any results to which their experiments
>> lead.
>> -- Norbert Wiener
>