On Aug 9, 2013, at 8:31 AM, Michael Lange <[email protected]> wrote:
> Hi, > > I just had a look at the threaded version of MatMult_SeqAIJ and I think the > Flops logging might be incorrect, because the nonzerorows aren't counted in > MatMult_SeqAIJ_Kernel. Fixing this in the thread kernel would require a > reduction though, which could impact performance. Is this a known problem, or > is there a better way to compute Flops, which doesn't require the nonzero > rows? Generally there should be some non zeros in each row, we could probably just use 2.0*a->nz - m Barry The reason for the nonzerorow counter goes back many years ago when we did not handle the usecprow case. The "off-diagonal" part of MPI matrices has many empty rows (most) and we wanted the count to take that into account. Now only the nonzero rows are handled so I think we can remove nonzerorow from the code completely. > > Alternatively, would it make sense to pre-compute the nonzerorows and store > them in the threadcomm? This might require more of the AIJ data structure to > be exposed to PetscLayoutSetUp / PetscThreadCommGetOwnershipRanges though. > > Regards, > Michael > > On 08/08/13 12:08, Matthew Knepley wrote: >> On Thu, Aug 8, 2013 at 5:37 AM, Michael Lange <[email protected]> >> wrote: >> Hi, >> >> We have recently been trying to re-align our OpenMP fork >> (https://bitbucket.org/ggorman/petsc-3.3-omp) with petsc/master. Much of our >> early work has now been superseded by the threadcomm implementations. >> Nevertheless, there are still a few algorithmic differences between the two >> branches: >> >> 1) Enforcing MPI latency hiding by using task-based spMV: >> If the MPI implementation used does not actually provide truly asynchronous >> communication in hardware, performance can be increased by dedicating a >> single thread to overlapping MPI communication in PETSc. However, this is >> arguably a vendor-specific fix which requires significant code changes (ie >> the parallel section needs to be raised up by one level). So perhaps the >> strategy should be to give guilty vendors a hard time rather than messing up >> the current abstraction. >> >> 2) Nonzero-based thread partitioning: >> Rather than evenly dividing the number of rows among threads, we can >> partition the thread ownership ranges according to the number of non-zeros >> in each row. This balances the work load between threads and thus increases >> strong scalability due to optimised bandwidth utilisation. In general, this >> optimisation should integrate well with threadcomms, since it only changes >> the thread ownership ranges, but it does require some structural changes >> since nnz is currently not passed to PetscLayoutSetUp. Any thoughts on >> whether people regard such a scheme as useful would be greatly appreciated. >> >> I think this should be handled by changing the AIJ data structure. Going all >> the way to "2D" partitions would also allow >> us to handle power-law matrix graphs. This would keep the thread >> implementation simple, and at the same time be more >> flexible. >> >> Matt >> >> 3) MatMult_SeqBAIJ not threaded: >> Is there a reason why MatMult has not been threaded for BAIJ matrices, or is >> somebody already working on this? If not, I would like to prepare a pull >> request for this using the same approach as MatMult_SeqAIJ. >> >> We would welcome any suggestions/feedback on this, in particular the second >> point. Up to date benchmarking results for the first two methods, including >> BlueGene/Q, can be found in: >> http://arxiv.org/abs/1307.4567 >> >> Kind regards, >> >> Michael Lange >> >> >> >> -- >> What most experimenters take for granted before they begin their experiments >> is infinitely more interesting than any results to which their experiments >> lead. >> -- Norbert Wiener >
