> On May 19, 2016, at 7:21 AM, <[email protected]>
> <[email protected]> wrote:
>
> Il 18-05-2016 21:23 Barry Smith ha scritto:
>
>>
>> Are the many small matrices you are extracting consisting of parts coming
>> from different processes? Or does each sub matrix come from the same process
>> (and remain on the same process) or are most coming from the same process
>> but some shared across multiple processes?
>>
>> The parallel code was originally written assuming one wishes to get a
>> small number of large sub matrices. Thus it is a bit communication heavy,
>> for example it doesn't share any communications or all reduce across the
>> matrices. It is, of course, possible to write custom code for your case but
>> in order to help you to do that we need to know more about your particular
>> use case.
I suspect that by far the biggest cost currently is that since the sub
matrices may span multiple processes each "getting" of a sub matrix requires
parallel communication and synchronization. Saving time on reusing a buffer is
very very minor compared to this cost.
To make your implementation really efficient I think you need to work
directly with the underlying MPIAIJ data structure so you can access the matrix
entries in their native format and you have to think about how to manage
efficiently that most of the time you don't need communication to get the sub
matrices but sometimes you do. Calling the PETSc MatGetSubmatrices will never
be efficient for your algorithm
Barry
>>
>> Barry
>>
>>
>>> On May 18, 2016, at 11:41 AM, Jed Brown <[email protected]> wrote:
>>>
>>> Could you give some more information about your preconditioner?
>>> Sequential extraction of many small submatrices is not high performance,
>>> but perhaps we can restructure.
>>>
>>> [email protected] writes:
>>>
>>>>
>>>>
>>>> Dear PETSc developers,
>>>> I have a matrix of the type MATMPIAIJ from which I want to extract small
>>>> sub-matrices and sub-vectors (number of rows from 1 to 50 maximum)
>>>> several times. This is a very fundamental step of a preconditioner that
>>>> I'm writing. After that, I want to solve these sub-systems locally.
>>>>
>>>> In my current implementation I use MatGetSubMatrices with
>>>> MAT_INITIAL_MATRIX to extract the sub-matrices and MatGetRow plus some
>>>> other artifacts to extract the sub-vectors. However, I see that this
>>>> procedure is too costly, once it creates and destroys the output
>>>> variables each time they are called. I thought about using the MAT_REUSE
>>>> flag, but I can't since the nonzero pattern of the local sub-matrix may
>>>> change from call to call.
>>>>
>>>> Can you give me some hint on how to do this task efficiently? I
>>>> appreciate your help!
>>>>
>>>> --
>>>>
>>>> Victor A. P. Magri - PhD student
>>>> Dept. of Civil, Environmental and Architectural Eng.
>>>> University of Padova
>>>> Via Marzolo, 9 - 35131 Padova, Italy
> Thank you for the prompt reply, Barry and Jed.
>
> If a communication reducing algorithm is firstly used such as the ones
> provided by METIS or Scotch, I would say that these sub-matrices are mostly
> coming from the same process and just some of them are shared across multiple
> processes. The preconditioner that I'm implementing is the Factorized Sparse
> Approximate Inverse with adaptive nonzero pattern generation. You can find it
> in the journal paper http://dl.acm.org/citation.cfm?doid=2732672.2629475. It
> relies in an explicit approximation (called matrix F) for the inverse of the
> lower triangular Cholesky factor.
>
> Given a maximum number of nonzeros per row of the F matrix, we want to find
> the best column positions and their respective values which gives the largest
> reduction in the Kaporin condition number of the preconditioned matrix. This
> search can be done concurrently from row to row. Considering an arbitrary row
> of F, we have to gather a sub-matrix and a sub-vector with sizes equal to the
> current number of nonzero elements already found for F and solve this
> subsystem, this process is repeated up to when the desired number of nonzeros
> is reached. Since the maximum size of these subsystems is known, I would like
> to allocate two local buffer variables (one for the sub-matrix and the other
> for the sub-vector) just one time and use it along the whole process, instead
> of creating and destroying new variables each time the sub-systems are
> gathered. This should reduce a lot the time for building the preconditioner
> since it is a task which is done a number of times.
>
> Thank you for your help!
>
> --
> Victor A. P. Magri - PhD student
> Dept. of Civil, Environmental and Architectural Eng.
> University of Padova
> Via Marzolo, 9 - 35131 Padova, Italy