On Tue, Feb 13, 2018 at 8:48 PM, Jed Brown <j...@jedbrown.org> wrote:
> Richard Tran Mills <rtmi...@anl.gov> writes:
> > I haven't experimented very thoroughly with it (hmm... should probably do
> > such experiments), but I believe that, once matrix rows become
> > long, then SELL doesn't provide an advantage over AIJ.
> What is the performance model that explains why SELL doesn't benefit
> from long rows? It's clear for large blocks where BAIJ can be compiled
> to vectorized loads and stores, but much less clear when AIJ is
> producing basically scalar code.
Hmm. I may be mistaken in my statements about long rows and AIJ -- I think
it might be the case that every matrix with long rows for which I've seen
the decent performance with AIJ was something that uses and benefits from
the Inode versions of the routines.
Why do you say that AIJ is producing basically scalar code? With the Intel
18.0 compiler and "-O3 -xMIC-AVX512" flags, the compiler report indicates
that the main loop in MatMult_SeqAIJ() is vectorized and the estimate
speedup is 3.410 -- not great, but not terrible, either. I unfortunately
never got very good at interpreting the assembly files generated by the
compiler -- I generally used VTune to look at the assembly for profiled
loops, and I haven't tried to do this right now -- but my (possibly
confused) look through the assembler output for that loop shows vector
instructions like vfmadd231pd. Am I missing something? (Entirely possible.)