Richard Tran Mills <rtmi...@anl.gov> writes: > On Tue, Nov 14, 2017 at 12:13 PM, Zhang, Hong <hongzh...@anl.gov> wrote: > >> >> >> On Nov 13, 2017, at 10:49 PM, Xiangdong <epsco...@gmail.com> wrote: >> >> 1) How about the vectorization of BAIJ format? >> >> >> BAIJ kernels are optimized with manual unrolling, but not with AVX >> intrinsics. So the vectorization relies on the compiler's ability. >> It may or may not get vectorized depending on the compiler's optimization >> decisions. But vectorization is not essential for the performance of most >> BAIJ kernels. >> > > I know that this has come up in previous discussions, but I'm guessing that > the manual unrolling actually impedes the ability of many modern compilers > to optimize the BAIJ calculations. I suppose we ought to have a switch to > enable or disable the use of the unrolled versions? (And, further down the > road, some sort of performance model to tell us what the setting for the > switch should be...)
I added a crude test for BAIJ(4), see branch 'jed/matbaij-loop'. Clang-5.0 is a bit better than gcc-7.2 for this problem. GCC produces comparable code and performance with both versions, but Clang produces tighter code (see below) for the current (fully unrolled) code, but it actually executes slower than the loop code. Testing as below, which produces a matrix with 284160 nonzeros (2.4 MB matrix, fits in my L3 cache). I use BCGS instead of GMRES so that the solve can be resident in cache. $ mpich-clang-opt/tests/src/snes/examples/tutorials/ex19 -da_grid_x 60 -da_grid_y 60 -prandtl 1e4 -ksp_type bcgs -dm_mat_type baij -pc_type none -mat_baij_loop 0 -log_view |grep MatMult MatMult 16269 1.0 1.8919e+00 1.0 9.01e+09 1.0 0.0e+00 0.0e+00 0.0e+00 78 77 0 0 0 78 77 0 0 0 4763 clang MatMult_SeqBAIJ_4 0.73 │2f0: movsxd rdi,DWORD PTR [rbp+0x0] 2.44 │ add rbp,0x4 0.24 │ shl rdi,0x5 0.98 │ vbroad ymm1,QWORD PTR [rax+rdi*1] 0.73 │ vbroad ymm2,QWORD PTR [rax+rdi*1+0x8] 2.93 │ vbroad ymm3,QWORD PTR [rax+rdi*1+0x10] 0.98 │ vbroad ymm4,QWORD PTR [rax+rdi*1+0x18] 2.44 │ vfmadd ymm1,ymm0,YMMWORD PTR [rsi] 23.47 │ vfmadd ymm1,ymm2,YMMWORD PTR [rsi+0x20] 8.31 │ vfmadd ymm1,ymm3,YMMWORD PTR [rsi+0x40] 0.98 │ vmovap ymm0,ymm1 26.89 │ vfmadd ymm0,ymm4,YMMWORD PTR [rsi+0x60] 0.49 │ sub rsi,0xffffffffffffff80 │ add edx,0xffffffff 0.24 │ ↑ jne 2f0 $ mpich-clang-opt/tests/src/snes/examples/tutorials/ex19 -da_grid_x 60 -da_grid_y 60 -prandtl 1e4 -ksp_type bcgs -dm_mat_type baij -pc_type none -mat_baij_loop 1 -log_view |grep MatMult MatMult 16269 1.0 1.6305e+00 1.0 9.01e+09 1.0 0.0e+00 0.0e+00 0.0e+00 73 77 0 0 0 73 77 0 0 0 5527 1.86 │130: cdqe │ vmovup ymm2,YMMWORD PTR [rbx+rax*8] 14.60 │ vmovup ymm3,YMMWORD PTR [rbx+rax*8+0x20] 1.24 │ vmovup ymm4,YMMWORD PTR [rbx+rax*8+0x40] 16.77 │ vmovup ymm5,YMMWORD PTR [rbx+rax*8+0x60] 2.17 │ vmovap YMMWORD PTR [rsp+0xc0],ymm5 0.93 │ vmovap YMMWORD PTR [rsp+0xa0],ymm4 0.62 │ vmovap YMMWORD PTR [rsp+0x80],ymm3 1.86 │ vmovap YMMWORD PTR [rsp+0x60],ymm2 0.93 │ mov esi,DWORD PTR [r13+rdi*4+0x0] 0.62 │ shl esi,0x2 0.62 │ movsxd rsi,esi 1.55 │ vbroad ymm2,QWORD PTR [rcx+rsi*8] 2.17 │ vfmadd ymm2,ymm1,YMMWORD PTR [rsp+0x60] 1.24 │ vbroad ymm1,QWORD PTR [rcx+rsi*8+0x8] 10.56 │ vfmadd ymm1,ymm2,YMMWORD PTR [rsp+0x80] 0.62 │ vbroad ymm2,QWORD PTR [rcx+rsi*8+0x10] 13.35 │ vfmadd ymm2,ymm1,YMMWORD PTR [rsp+0xa0] 1.86 │ vbroad ymm1,QWORD PTR [rcx+rsi*8+0x18] 15.53 │ vfmadd ymm1,ymm2,YMMWORD PTR [rsp+0xc0] │ add rdi,0x1 │ add eax,0x10 │ cmp rdi,rdx │ ↑ jl 130 The code with loops is faster with GCC as well, but the assembly is not as clean in either case. I don't have time to do more comprehensive testing at the moment, but it would be really useful to test with other block sizes, especially 3 (elasticity) and 5 (compressible flow) and with other compilers (especially Intel). If the performance advantage of loops holds, we can eliminate tons of code from PETSc by judicious use of inline functions.