Richard Tran Mills <rtmi...@anl.gov> writes:

> On Tue, Feb 13, 2018 at 8:48 PM, Jed Brown <j...@jedbrown.org> wrote:
>
>> Richard Tran Mills <rtmi...@anl.gov> writes:
>>
>> > I haven't experimented very thoroughly with it (hmm... should probably do
>> > such experiments), but I believe that, once matrix rows become
>> sufficiently
>> > long, then SELL doesn't provide an advantage over AIJ.
>>
>> What is the performance model that explains why SELL doesn't benefit
>> from long rows?  It's clear for large blocks where BAIJ can be compiled
>> to vectorized loads and stores, but much less clear when AIJ is
>> producing basically scalar code.
>>
>
> Hmm. I may be mistaken in my statements about long rows and AIJ -- I think
> it might be the case that every matrix with long rows for which I've seen
> the decent performance with AIJ was something that uses and benefits from
> the Inode versions of the routines.

Have you tested that Inodes are an improvement?  I've seen the opposite
(by about 5%) on some recent Intel.

> Why do you say that AIJ is producing basically scalar code? With the Intel
> 18.0 compiler and "-O3 -xMIC-AVX512" flags, the compiler report indicates
> that the main loop in MatMult_SeqAIJ() is vectorized and the estimate
> speedup is 3.410 -- not great, but not terrible, either. I unfortunately
> never got very good at interpreting the assembly files generated by the
> compiler -- I generally used VTune to look at the assembly for profiled
> loops, and I haven't tried to do this right now -- but my (possibly
> confused) look through the assembler output for that loop shows vector
> instructions like vfmadd231pd. Am I missing something? (Entirely possible.)

I'm haven't looked at that assembly recently, but that would work (for
long rows) if it reorders the accumulation, though there is a slow
horizontal reduction at the end (probably not very important).

Reply via email to