On Aug 9, 2013, at 8:31 AM, Michael Lange <[email protected]> wrote:

> Hi,
> 
> I just had a look at the threaded version of MatMult_SeqAIJ and I think the 
> Flops logging might be incorrect, because the nonzerorows aren't counted in 
> MatMult_SeqAIJ_Kernel. Fixing this in the thread kernel would require a 
> reduction though, which could impact performance. Is this a known problem, or 
> is there a better way to compute Flops, which doesn't require the nonzero 
> rows?

   Generally there should be some non zeros in each row, we could probably just 
use 2.0*a->nz - m 

   Barry

   The reason for the nonzerorow counter goes back many years ago when we did 
not handle the usecprow case. The "off-diagonal" part of MPI matrices has many 
empty rows (most) and we wanted the count to take that into account. Now only 
the nonzero rows are handled so I think we can remove nonzerorow from the code 
completely.

> 
> Alternatively, would it make sense to pre-compute the nonzerorows and store 
> them in the threadcomm? This might require more of the AIJ data structure to 
> be exposed to PetscLayoutSetUp / PetscThreadCommGetOwnershipRanges though. 
> 
> Regards,
> Michael 
> 
> On 08/08/13 12:08, Matthew Knepley wrote:
>> On Thu, Aug 8, 2013 at 5:37 AM, Michael Lange <[email protected]> 
>> wrote:
>> Hi,
>> 
>> We have recently been trying to re-align our OpenMP fork 
>> (https://bitbucket.org/ggorman/petsc-3.3-omp) with petsc/master. Much of our 
>> early work has now been superseded by the threadcomm implementations. 
>> Nevertheless, there are still a few algorithmic differences between the two 
>> branches:
>> 
>> 1) Enforcing MPI latency hiding by using task-based spMV:
>> If the MPI implementation used does not actually provide truly asynchronous 
>> communication in hardware, performance can be increased by dedicating a 
>> single thread to overlapping MPI communication in PETSc. However, this is 
>> arguably a vendor-specific fix which requires significant code changes (ie 
>> the parallel section needs to be raised up by one level). So perhaps the 
>> strategy should be to give guilty vendors a hard time rather than messing up 
>> the current abstraction.
>> 
>> 2) Nonzero-based thread partitioning:
>> Rather than evenly dividing the number of rows among threads, we can 
>> partition the thread ownership ranges according to the number of non-zeros 
>> in each row. This balances the work load between threads and thus increases 
>> strong scalability due to optimised bandwidth utilisation. In general, this 
>> optimisation should integrate well with threadcomms, since it only changes 
>> the thread ownership ranges, but it does require some structural changes 
>> since nnz is currently not passed to PetscLayoutSetUp. Any thoughts on 
>> whether people regard such a scheme as useful would be greatly appreciated.
>> 
>> I think this should be handled by changing the AIJ data structure. Going all 
>> the way to "2D" partitions would also allow
>> us to handle power-law matrix graphs. This would keep the thread 
>> implementation simple, and at the same time be more
>> flexible.
>> 
>>    Matt
>>  
>> 3) MatMult_SeqBAIJ not threaded:
>> Is there a reason why MatMult has not been threaded for BAIJ matrices, or is 
>> somebody already working on this? If not, I would like to prepare a pull 
>> request for this using the same approach as MatMult_SeqAIJ.
>> 
>> We would welcome any suggestions/feedback on this, in particular the second 
>> point. Up to date benchmarking results for the first two methods, including 
>> BlueGene/Q, can be found in:
>> http://arxiv.org/abs/1307.4567
>> 
>> Kind regards,
>> 
>> Michael Lange
>> 
>> 
>> 
>> -- 
>> What most experimenters take for granted before they begin their experiments 
>> is infinitely more interesting than any results to which their experiments 
>> lead.
>> -- Norbert Wiener
> 

Reply via email to