Thanks Barry and Jed, This makes sense.
On a slightly separate note. Barry, can we always guarantee (or at least forbid the users from breaking) no-aliasing between PETSc vectors and matrices? I know matmult and matmultadd forbid aliased vectors, but nothing in PETSc prevents you from doing something silly like stuffing the same buffer address into multiple vectors. A On Wed, Oct 6, 2010 at 8:50 PM, Barry Smith <bsmith at mcs.anl.gov> wrote: > > ? Make a whole new subclass of SeqAIJ (parallel to the Inode) that does all > this cool stuff and copies into the new aligned data structures (rather than > keeping the data in the same data structure (as the current inode does). > ?We'll just have to get the factorization stuff to work eventually once you > show good performance gain for MatMult_SeqAIJ_AlignedInode(). > > > ? Barry > > On Oct 6, 2010, at 7:04 PM, Jed Brown wrote: > >> Looking at assembly generated from the Inode kernels, I see that it does not >> use packed instructions within the blocks. ?I tried both gcc-4.5.1 and >> icc-11.0.081 at -O3, the latter took 3 minutes 40 seconds to compile >> inode.c, but neither generated packed instructions. ?Aron and John (cc'd) >> see similar effects on Blue Gene. ?The reason for this is that the input >> arrays may not be aligned, and most of the packed instructions (except >> movups/d) require 16-byte alignment, the situation is similar on BG. ?The >> code size to check and dispatch to a kernel that makes only valid alignment >> assumptions would be enormous, so the compiler does not do it. >> >> This is not a huge deal on x86-64 since the operation is mostly memory >> limited anyway, but it would be nice to have the ability to specify an >> alignment to be guaranteed at the beginning of each row. ?The situation is >> quite different on Blue Gene where peak bandwidth can only be obtained with >> (aligned) 16-byte loads into the packed registers. ?Also, Intel/AMD will add >> AVX next year which has 32-byte packed registers. ?So it would be good if >> the matrix kernels could support alignment constraints on the row starts >> (padding out odd row lengths). >> >> I think it should be a runtime option rather than compiled in because, e.g. >> a 5-point stencil would need to be padded out to 8 with single precision or >> with double+AVX, and a 9-point stencil would be padded to 16 with >> single+AVX. ?A simulation that solved a light 2D problem coupled to a heavy >> 3D problem (maybe on a smaller domain, or with less stiff time scales) would >> suffer from having the choice compiled in. >> >> The Inode kernels could then be specialized for aligned row starts and >> regular row lengths. ?I could outfit an aligned MatMult_SeqAIJ_Inode with >> SSE kernels in under an hour, so I don't think that is a huge time >> investment. ?Aron and John are looking at sparse kernels on Blue Gene where >> alignment is perhaps more important, it sounds like they would be able to >> contribute a couple Blue Gene kernels. >> >> I think it's also straightforward on the allocation front, but I don't know >> if it would be complicated to make the factorization kernels handle the >> padding. ?Are there deep assumptions about unpadded that would be difficult >> to remove? >> >> Jed > >
