Thanks Barry and Jed,

This makes sense.

On a slightly separate note.  Barry, can we always guarantee (or at
least forbid the users from breaking) no-aliasing between PETSc
vectors and matrices?  I know matmult and matmultadd forbid aliased
vectors, but nothing in PETSc prevents you from doing something silly
like stuffing the same buffer address into multiple vectors.

A

On Wed, Oct 6, 2010 at 8:50 PM, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
> ? Make a whole new subclass of SeqAIJ (parallel to the Inode) that does all 
> this cool stuff and copies into the new aligned data structures (rather than 
> keeping the data in the same data structure (as the current inode does). 
> ?We'll just have to get the factorization stuff to work eventually once you 
> show good performance gain for MatMult_SeqAIJ_AlignedInode().
>
>
> ? Barry
>
> On Oct 6, 2010, at 7:04 PM, Jed Brown wrote:
>
>> Looking at assembly generated from the Inode kernels, I see that it does not 
>> use packed instructions within the blocks. ?I tried both gcc-4.5.1 and 
>> icc-11.0.081 at -O3, the latter took 3 minutes 40 seconds to compile 
>> inode.c, but neither generated packed instructions. ?Aron and John (cc'd) 
>> see similar effects on Blue Gene. ?The reason for this is that the input 
>> arrays may not be aligned, and most of the packed instructions (except 
>> movups/d) require 16-byte alignment, the situation is similar on BG. ?The 
>> code size to check and dispatch to a kernel that makes only valid alignment 
>> assumptions would be enormous, so the compiler does not do it.
>>
>> This is not a huge deal on x86-64 since the operation is mostly memory 
>> limited anyway, but it would be nice to have the ability to specify an 
>> alignment to be guaranteed at the beginning of each row. ?The situation is 
>> quite different on Blue Gene where peak bandwidth can only be obtained with 
>> (aligned) 16-byte loads into the packed registers. ?Also, Intel/AMD will add 
>> AVX next year which has 32-byte packed registers. ?So it would be good if 
>> the matrix kernels could support alignment constraints on the row starts 
>> (padding out odd row lengths).
>>
>> I think it should be a runtime option rather than compiled in because, e.g. 
>> a 5-point stencil would need to be padded out to 8 with single precision or 
>> with double+AVX, and a 9-point stencil would be padded to 16 with 
>> single+AVX. ?A simulation that solved a light 2D problem coupled to a heavy 
>> 3D problem (maybe on a smaller domain, or with less stiff time scales) would 
>> suffer from having the choice compiled in.
>>
>> The Inode kernels could then be specialized for aligned row starts and 
>> regular row lengths. ?I could outfit an aligned MatMult_SeqAIJ_Inode with 
>> SSE kernels in under an hour, so I don't think that is a huge time 
>> investment. ?Aron and John are looking at sparse kernels on Blue Gene where 
>> alignment is perhaps more important, it sounds like they would be able to 
>> contribute a couple Blue Gene kernels.
>>
>> I think it's also straightforward on the allocation front, but I don't know 
>> if it would be complicated to make the factorization kernels handle the 
>> padding. ?Are there deep assumptions about unpadded that would be difficult 
>> to remove?
>>
>> Jed
>
>

Reply via email to