Re: [petsc-users] PETSc with modern C++

Matthew Knepley Tue, 04 Apr 2017 18:18:29 -0700

On Tue, Apr 4, 2017 at 3:40 PM, Filippo Leonardi <[email protected]>
wrote:


> I had weird issues where gcc (that I am using for my tests right now)
> wasn't vectorising properly (even enabling all flags, from tree-vectorize,
> to mavx). According to my tests, I know the Intel compiler was a bit better
> at that.
>

We are definitely at the mercy of the compiler for this. Maybe Jed has an
idea why its not vectorizing.


> I actually did not know PETSc was doing some unrolling himself. On my
> machine, PETSc aligns the memory to 16 bytes, that might also be a cause.
> However, I have no idea on the ability of the different compilers to detect
> and vectorize codes, even when those are manually unrolled by the user. I
> see PETSc unrolls the loop for MAXPY up 4 right hand side vectors. Just by
> coincidence, the numbers I reported are actually for this latter case.
>
> I pushed my code to a public git repository. However let me stress that
> the code is extremely ugly and buggy. I actually feel shame in showing you
> the code :) :
> https://bitbucket.org/FilippoL/petscxx/commits/branch/master
>
> If you want to look at the API usage, you can find it in the file
> "/src/benchmarks/bench_vector.cpp" after line 153.
>
> If you want the expressions generating the kernels you have to look at the
> file "/src/base/vectorexpression.hpp", specifically the "evalat" members.
>

I looked. I guess it it a matter of taste whether this seems simpler, or
unrolling and putting in an intrinsic call. The automation is undeniable.

   Matt


> On Tue, 4 Apr 2017 at 21:39 Matthew Knepley <[email protected]> wrote:
>
> On Tue, Apr 4, 2017 at 1:19 PM, Filippo Leonardi <[email protected]>
> wrote:
>
> You are in fact right, it is the same speedup  of approximatively 2.5x
> (with 2 ranks), my brain rounded up to 3. (This was just a test done in 10
> min on my Workstation, so no pretence to be definite, I just wanted to have
> an indication).
>
>
> Hmm, it seems like PetscKernelAXPY4() is just not vectorizing correctly
> then. I would be interested to see your code.
>
> As you say, I am using OpenBLAS, so I wouldn't be surprised of those
> results. If/when I use MKL (or something similar), I really do not expect
> such an improvement).
>
> Since you seem interested (if you are interested, I can give you all the
> details): the comparison I make, is with "petscxx" which is my template
> code (which uses a single loop) using AVX (I force PETSc to align the
> memory to 32 bit boundary and then I use packets of 4 doubles). Also notice
> that I use vectors with nice lengths, so there is no need to "peel" the end
> of the loop. The "PETSc" simulation is using PETSc's VecMAXPY.
>
>
> Thanks,
>
>    Matt
>
>
> On Tue, 4 Apr 2017 at 19:12 Barry Smith <[email protected]> wrote:
>
>
>   MAXPY isn't really a BLAS 1 since it can reuse some data in certain
> vectors.
>
>
> > On Apr 4, 2017, at 10:25 AM, Filippo Leonardi <[email protected]>
> wrote:
> >
> > I really appreciate the feedback. Thanks.
> >
> > That of deadlock, when the order of destruction is not preserved, is a
> point I hadn't thought of. Maybe it can be cleverly addressed.
> >
> > PS: If you are interested, I ran some benchmark on BLAS1 stuff and, for
> a single processor, I obtain:
> >
> > Example for MAXPY, with expression templates:
> > BM_Vector_petscxx_MAXPY/8 38 ns 38 ns 18369805
> > BM_Vector_petscxx_MAXPY/64 622 ns 622 ns 1364335
> > BM_Vector_petscxx_MAXPY/512 281 ns 281 ns 2477718
> > BM_Vector_petscxx_MAXPY/4096 2046 ns 2046 ns 349954
> > BM_Vector_petscxx_MAXPY/32768 18012 ns 18012 ns 38788
> > BM_Vector_petscxx_MAXPY_BigO 0.55 N 0.55 N
> > BM_Vector_petscxx_MAXPY_RMS 7 % 7 %
> > Direct call to MAXPY:
> > BM_Vector_PETSc_MAXPY/8 33 ns 33 ns 20973674
> > BM_Vector_PETSc_MAXPY/64 116 ns 116 ns 5992878
> > BM_Vector_PETSc_MAXPY/512 731 ns 731 ns 963340
> > BM_Vector_PETSc_MAXPY/4096 5739 ns 5739 ns 122414
> > BM_Vector_PETSc_MAXPY/32768 46346 ns 46346 ns 15312
> > BM_Vector_PETSc_MAXPY_BigO 1.41 N 1.41 N
> > BM_Vector_PETSc_MAXPY_RMS 0 % 0 %
> >
> > And 3x speedup on 2 MPI ranks (not much communication here, anyway). I
> am now convinced that this warrants some further investigation/testing.
> >
> >
> > On Tue, 4 Apr 2017 at 01:08 Jed Brown <[email protected]> wrote:
> > Matthew Knepley <[email protected]> writes:
> >
> > >> BLAS. (Here a interesting point opens: I assume an efficient BLAS
> > >>
> > >> implementation, but I am not so sure about how the different BLAS do
> > >> things
> > >>
> > >> internally. I work from the assumption that we have a very well tuned
> BLAS
> > >>
> > >> implementation at our disposal).
> > >>
> > >
> > > The speed improvement comes from pulling vectors through memory fewer
> > > times by merging operations (kernel fusion).
> >
> > Typical examples are VecMAXPY and VecMDot, but note that these are not
> > xGEMV because the vectors are independent arrays rather than single
> > arrays with a constant leading dimension.
> >
> > >> call VecGetArray. However I will inevitably foget to return the array
> to
> > >>
> > >> PETSc. I could have my new VecArray returning an object that restores
> the
> > >>
> > >> array
> > >>
> > >> when it goes out of scope. I can also flag the function with
> [[nodiscard]]
> > >> to
> > >>
> > >> prevent the user to destroy the returned object from the start.
> > >>
> > >
> > > Jed claims that this pattern is no longer preferred, but I have
> forgotten
> > > his argument.
> > > Jed?
> >
> > Destruction order matters and needs to be collective.  If an error
> > condition causes destruction to occur in a different order on different
> > processes, you can get deadlock.  I would much rather have an error
> > leave some resources (for the OS to collect) than escalate into
> > deadlock.
> >
> > > We have had this discussion for years on this list. Having separate
> names
> > > for each type
> > > is really ugly and does not achieve what we want. We want smooth
> > > interoperability between
> > > objects with different backing types, but it is still not clear how to
> do
> > > this.
> >
> > Hide it internally and implicitly promote.  Only the *GetArray functions
> > need to be parametrized on numeric type.  But it's a lot of work on the
> > backend.
>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

Re: [petsc-users] PETSc with modern C++

Reply via email to