Chris, You did drop into the middle of a discussion, but it still sounds like you dropped into the right discussion.
The context here is this: - we actually often do have sparse-sparse operations. These are, as you say, channel limited rather than CPU limited - we have interesting combinations of operations such as A'A or A'B or elementwise operations of various sorts which can be computed in many different ways. Some of the matrices have strong row-major preference, some are transposed row-major stuff and thus are very column major preferent. Some of the matrices have good sequential access, but terrible random access, others have great random access, but terrible cache properties. - even dense matrices have very strong access pattern preferences for obvious reasons - our math system is becoming a two level system. On the upper level, we do lazy evaluation and some simple cost based heuristic rearrangement to improve performance. On the lower level, the large scale operations are set, but we get lots of mileage out of a cost based optimization that picks the looping pattern to use. This optimizer looks properties of the vectors in question as well as of the functions being executed and decides what it can do. Our current discussion is about how we can improve the large matrix operations at the lower level. This is prompted by the discovery that one of our sparse matrices was doing an abysmal job of optimizing certain products (i.e. it was devolving to full dense operation). I put a quick hack in place to avoid that, but it would be good to get something better. We would get bonus points if the physical operations that we expose at the lower level are expressive enough to allow the higher level to do a better job of large scale optimization. This connects pretty closely to your comments in a few areas. Where you are talking about communication minimization, read high level optimizer. Were you talk about cache friendliness, read our low level optimizer. The big difference is that we really do care about sparse-sparse cases and don't have a way to reduce to sparse-dense cases. In a few cases, we might have matrices that have skewed sparsity such that they could be row and column permuted into a very small dense kernel adjoined to the right and below with relatively sparse skinny matrices and diagonally down and to the right with a superbly sparse matrix. Whether or not this really helps is an open question and we have often depended on the near randomness of our native arrangement to argue that we won't have much imbalance in the parallel execution. Many times this is even true. So does that catch you up? On Mon, Jun 9, 2014 at 4:45 PM, Chris Baker <[email protected]> wrote: > I just joined the mailing lists and have apparently dropped in the middle > of the conversation, so excuse me if I'm repeating stuff that's already > been said and/or isn't useful. > > Sparse mat-vec is, in general, a bandwidth-limited operation. It has very > low arithmetic density and is, in general, a waste of a good CPU. I'll > assume dense operand vector, since that is the use case for most algorithms > (and since it's what I'm most familiar with ;). > > The best options for improving the performance are to improve the memory > access patterns through the operand vector (assuming SRM). One of the > simplest way to do that is to turn your sparse matrix into a block matrix > and then rely on efficient local/dense matvec to make up for the zero > elements that you undoubtedly introduced; this generally requires > reordering the matrix to expose these blocks. If the analysis required for > this isn't itself parallel, then you've traded your embarrassingly parallel > (albeit, computationally dense) for a largely serial operation; you'd > better be using a lot of times if you want to amortize that cost. > > There is a whole class of communication avoiding algorithms that seek to > avoid this off-die and off-node data transfer, for example, by applying the > matrix in powers (exploit a little extra transfer to perform A^3, A^2 and A > at the same time and perform three steps of, e.g., Lanczos, while paying > for the latency of only one). However, the performance of these is still > very dependent on the sparsity structure of the matrix, and the algorithms > are significantly more difficult and numerically sensitive. > > Block algorithms (those that apply the matrix to multiple operand vectors > at the same time) are also useful, because they allow for re-use of the > sparsity structure. Course, that's a chicken-egg issue; there's limited > incentive to write block algorithms without a block sparse mat-vec, and > limited reason to write a blocked mat-vec without the indication that folks > will write algorithms. > > > On Mon, Jun 9, 2014 at 7:20 PM, Dmitriy Lyubimov <[email protected]> > wrote: > > > no argument against it, except who's going to do e2e local math optimizer > > and how long it will take. maybe, both is possible -- faster local gains > we > > know current algorithms need, and longer term overhaul that included e2e > > in-core plans. > > > > > > On Mon, Jun 9, 2014 at 4:16 PM, Ted Dunning <[email protected]> > wrote: > > > > > On Mon, Jun 9, 2014 at 4:12 PM, Dmitriy Lyubimov <[email protected]> > > > wrote: > > > > > > > One open question i had was,well, since the outlined approach avoids > > > > logical plans still for in-core matrix, it would be difficult to > > optimize > > > > the expressions in this context (i.e. figure out best-working > _output_ > > > > structure). > > > > > > > > > > This is the single biggest flaw that I see. To do higher level > > > optimizations, the lower levels have to expose costs for possible > plans. > > > > > > Taking a major whack at the local cases for element-wise and matrix > > > multiply is still a good thing to do. > > > > > >
