On Wed, 10 Oct 2007, Jaime Suarez wrote:
> Also, I would like to perform a matrix-vector multiplication in
> parallel, but this time the matrix would be symmetric, so that the
> libraries for sparce matrix do not seem efficient for equally
> distributing the jobs in several processors. Can you give me some advice
> about what to do?
>
> Thanking you in advance,
>
> Jaime Suarez
>
>
This is a good point. Though the symmetric compressed sparse row (CSR) matrix
format is good for reducing memory usage, it makes load balancing of __floating
point operations__
and __communication__ difficult. If roughly the same number of rows are
assigned to
each process the last process has less nonzero's (and hence less floating point
to do),
in addition the neighbor to neighbor communication is not symmetric. The first
process needs
to receive from several other processes, but not send ANY messages, the last
process receives
no messages but needs to send to several. The other issue is when the messages
are posted.
There need to be TWO "waves" of communication, each process needs to receive
the remote vector
entries to do the "upper trianglar" part of the multiply. They also need to
send the partial
results they computed for the "lower triangular" part ot the multiply to the
process that owns
it. We found that actually doing this as two seperate VecScatters made the
whole multiply
a good bit slower thant the MPIAIJ format; hence we merged them so there is
only one VecScatter
and hence one "wave" of communication. BUT this introduces its own problem, the
first process cannot start
its scatter until it has done ALL the lower part of the multiply so later
processes will be
waiting a bit longer.
Ways to improve the situation
1) give the later processes more rows to balance the number of nonzeros (and
hence floating point operations)
per process, the drawback to this is
a) calculating the rows that each process should get (which depends on the
nonzero pattern)
b) the synchronization of when sends and receives are sent by different
processes is worse
c) the rest of the code may be now unbalanced since latter processes have
many more entries in the
vector
2) just use the MPIAIJ format. Give up memory savings.
3) store the "block diagonal" portion of the matrix (that portion associated
with the local rows AND columns
of the matrix) in SBAIJ format but redundantly store the off-diagonal
entries on the appropriate processes.
This has nice load balancing. If we assume the vast majority of the matrix
entries are in the
"block diagonal" portion then this will still save most of the memory that
the MPISBAIJ format saves.
4) ???
If there is a lot of demand for an efficient parallel symmetric format it may
make sense for us
to implement 3.
Barry