Richard Tran Mills <rtmi...@anl.gov> writes: > On Tue, Feb 13, 2018 at 8:48 PM, Jed Brown <j...@jedbrown.org> wrote: > >> Richard Tran Mills <rtmi...@anl.gov> writes: >> >> > I haven't experimented very thoroughly with it (hmm... should probably do >> > such experiments), but I believe that, once matrix rows become >> sufficiently >> > long, then SELL doesn't provide an advantage over AIJ. >> >> What is the performance model that explains why SELL doesn't benefit >> from long rows? It's clear for large blocks where BAIJ can be compiled >> to vectorized loads and stores, but much less clear when AIJ is >> producing basically scalar code. >> > > Hmm. I may be mistaken in my statements about long rows and AIJ -- I think > it might be the case that every matrix with long rows for which I've seen > the decent performance with AIJ was something that uses and benefits from > the Inode versions of the routines.
Have you tested that Inodes are an improvement? I've seen the opposite (by about 5%) on some recent Intel. > Why do you say that AIJ is producing basically scalar code? With the Intel > 18.0 compiler and "-O3 -xMIC-AVX512" flags, the compiler report indicates > that the main loop in MatMult_SeqAIJ() is vectorized and the estimate > speedup is 3.410 -- not great, but not terrible, either. I unfortunately > never got very good at interpreting the assembly files generated by the > compiler -- I generally used VTune to look at the assembly for profiled > loops, and I haven't tried to do this right now -- but my (possibly > confused) look through the assembler output for that loop shows vector > instructions like vfmadd231pd. Am I missing something? (Entirely possible.) I'm haven't looked at that assembly recently, but that would work (for long rows) if it reorders the accumulation, though there is a slow horizontal reduction at the end (probably not very important).