I just joined the mailing lists and have apparently dropped in the middle of the conversation, so excuse me if I'm repeating stuff that's already been said and/or isn't useful.
Sparse mat-vec is, in general, a bandwidth-limited operation. It has very low arithmetic density and is, in general, a waste of a good CPU. I'll assume dense operand vector, since that is the use case for most algorithms (and since it's what I'm most familiar with ;). The best options for improving the performance are to improve the memory access patterns through the operand vector (assuming SRM). One of the simplest way to do that is to turn your sparse matrix into a block matrix and then rely on efficient local/dense matvec to make up for the zero elements that you undoubtedly introduced; this generally requires reordering the matrix to expose these blocks. If the analysis required for this isn't itself parallel, then you've traded your embarrassingly parallel (albeit, computationally dense) for a largely serial operation; you'd better be using a lot of times if you want to amortize that cost. There is a whole class of communication avoiding algorithms that seek to avoid this off-die and off-node data transfer, for example, by applying the matrix in powers (exploit a little extra transfer to perform A^3, A^2 and A at the same time and perform three steps of, e.g., Lanczos, while paying for the latency of only one). However, the performance of these is still very dependent on the sparsity structure of the matrix, and the algorithms are significantly more difficult and numerically sensitive. Block algorithms (those that apply the matrix to multiple operand vectors at the same time) are also useful, because they allow for re-use of the sparsity structure. Course, that's a chicken-egg issue; there's limited incentive to write block algorithms without a block sparse mat-vec, and limited reason to write a blocked mat-vec without the indication that folks will write algorithms. On Mon, Jun 9, 2014 at 7:20 PM, Dmitriy Lyubimov <[email protected]> wrote: > no argument against it, except who's going to do e2e local math optimizer > and how long it will take. maybe, both is possible -- faster local gains we > know current algorithms need, and longer term overhaul that included e2e > in-core plans. > > > On Mon, Jun 9, 2014 at 4:16 PM, Ted Dunning <[email protected]> wrote: > > > On Mon, Jun 9, 2014 at 4:12 PM, Dmitriy Lyubimov <[email protected]> > > wrote: > > > > > One open question i had was,well, since the outlined approach avoids > > > logical plans still for in-core matrix, it would be difficult to > optimize > > > the expressions in this context (i.e. figure out best-working _output_ > > > structure). > > > > > > > This is the single biggest flaw that I see. To do higher level > > optimizations, the lower levels have to expose costs for possible plans. > > > > Taking a major whack at the local cases for element-wise and matrix > > multiply is still a good thing to do. > > >
