On Tue, Apr 4, 2017 at 3:40 PM, Filippo Leonardi <[email protected]> wrote:
> I had weird issues where gcc (that I am using for my tests right now) > wasn't vectorising properly (even enabling all flags, from tree-vectorize, > to mavx). According to my tests, I know the Intel compiler was a bit better > at that. > We are definitely at the mercy of the compiler for this. Maybe Jed has an idea why its not vectorizing. > I actually did not know PETSc was doing some unrolling himself. On my > machine, PETSc aligns the memory to 16 bytes, that might also be a cause. > However, I have no idea on the ability of the different compilers to detect > and vectorize codes, even when those are manually unrolled by the user. I > see PETSc unrolls the loop for MAXPY up 4 right hand side vectors. Just by > coincidence, the numbers I reported are actually for this latter case. > > I pushed my code to a public git repository. However let me stress that > the code is extremely ugly and buggy. I actually feel shame in showing you > the code :) : > https://bitbucket.org/FilippoL/petscxx/commits/branch/master > > If you want to look at the API usage, you can find it in the file > "/src/benchmarks/bench_vector.cpp" after line 153. > > If you want the expressions generating the kernels you have to look at the > file "/src/base/vectorexpression.hpp", specifically the "evalat" members. > I looked. I guess it it a matter of taste whether this seems simpler, or unrolling and putting in an intrinsic call. The automation is undeniable. Matt > On Tue, 4 Apr 2017 at 21:39 Matthew Knepley <[email protected]> wrote: > > On Tue, Apr 4, 2017 at 1:19 PM, Filippo Leonardi <[email protected]> > wrote: > > You are in fact right, it is the same speedup of approximatively 2.5x > (with 2 ranks), my brain rounded up to 3. (This was just a test done in 10 > min on my Workstation, so no pretence to be definite, I just wanted to have > an indication). > > > Hmm, it seems like PetscKernelAXPY4() is just not vectorizing correctly > then. I would be interested to see your code. > > As you say, I am using OpenBLAS, so I wouldn't be surprised of those > results. If/when I use MKL (or something similar), I really do not expect > such an improvement). > > Since you seem interested (if you are interested, I can give you all the > details): the comparison I make, is with "petscxx" which is my template > code (which uses a single loop) using AVX (I force PETSc to align the > memory to 32 bit boundary and then I use packets of 4 doubles). Also notice > that I use vectors with nice lengths, so there is no need to "peel" the end > of the loop. The "PETSc" simulation is using PETSc's VecMAXPY. > > > Thanks, > > Matt > > > On Tue, 4 Apr 2017 at 19:12 Barry Smith <[email protected]> wrote: > > > MAXPY isn't really a BLAS 1 since it can reuse some data in certain > vectors. > > > > On Apr 4, 2017, at 10:25 AM, Filippo Leonardi <[email protected]> > wrote: > > > > I really appreciate the feedback. Thanks. > > > > That of deadlock, when the order of destruction is not preserved, is a > point I hadn't thought of. Maybe it can be cleverly addressed. > > > > PS: If you are interested, I ran some benchmark on BLAS1 stuff and, for > a single processor, I obtain: > > > > Example for MAXPY, with expression templates: > > BM_Vector_petscxx_MAXPY/8 38 ns 38 ns 18369805 > > BM_Vector_petscxx_MAXPY/64 622 ns 622 ns 1364335 > > BM_Vector_petscxx_MAXPY/512 281 ns 281 ns 2477718 > > BM_Vector_petscxx_MAXPY/4096 2046 ns 2046 ns 349954 > > BM_Vector_petscxx_MAXPY/32768 18012 ns 18012 ns 38788 > > BM_Vector_petscxx_MAXPY_BigO 0.55 N 0.55 N > > BM_Vector_petscxx_MAXPY_RMS 7 % 7 % > > Direct call to MAXPY: > > BM_Vector_PETSc_MAXPY/8 33 ns 33 ns 20973674 > > BM_Vector_PETSc_MAXPY/64 116 ns 116 ns 5992878 > > BM_Vector_PETSc_MAXPY/512 731 ns 731 ns 963340 > > BM_Vector_PETSc_MAXPY/4096 5739 ns 5739 ns 122414 > > BM_Vector_PETSc_MAXPY/32768 46346 ns 46346 ns 15312 > > BM_Vector_PETSc_MAXPY_BigO 1.41 N 1.41 N > > BM_Vector_PETSc_MAXPY_RMS 0 % 0 % > > > > And 3x speedup on 2 MPI ranks (not much communication here, anyway). I > am now convinced that this warrants some further investigation/testing. > > > > > > On Tue, 4 Apr 2017 at 01:08 Jed Brown <[email protected]> wrote: > > Matthew Knepley <[email protected]> writes: > > > > >> BLAS. (Here a interesting point opens: I assume an efficient BLAS > > >> > > >> implementation, but I am not so sure about how the different BLAS do > > >> things > > >> > > >> internally. I work from the assumption that we have a very well tuned > BLAS > > >> > > >> implementation at our disposal). > > >> > > > > > > The speed improvement comes from pulling vectors through memory fewer > > > times by merging operations (kernel fusion). > > > > Typical examples are VecMAXPY and VecMDot, but note that these are not > > xGEMV because the vectors are independent arrays rather than single > > arrays with a constant leading dimension. > > > > >> call VecGetArray. However I will inevitably foget to return the array > to > > >> > > >> PETSc. I could have my new VecArray returning an object that restores > the > > >> > > >> array > > >> > > >> when it goes out of scope. I can also flag the function with > [[nodiscard]] > > >> to > > >> > > >> prevent the user to destroy the returned object from the start. > > >> > > > > > > Jed claims that this pattern is no longer preferred, but I have > forgotten > > > his argument. > > > Jed? > > > > Destruction order matters and needs to be collective. If an error > > condition causes destruction to occur in a different order on different > > processes, you can get deadlock. I would much rather have an error > > leave some resources (for the OS to collect) than escalate into > > deadlock. > > > > > We have had this discussion for years on this list. Having separate > names > > > for each type > > > is really ugly and does not achieve what we want. We want smooth > > > interoperability between > > > objects with different backing types, but it is still not clear how to > do > > > this. > > > > Hide it internally and implicitly promote. Only the *GetArray functions > > need to be parametrized on numeric type. But it's a lot of work on the > > backend. > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
