Re: [petsc-users] PETSc with modern C++

Matthew Knepley Tue, 04 Apr 2017 08:51:12 -0700

On Tue, Apr 4, 2017 at 10:25 AM, Filippo Leonardi <filippo.l...@gmail.com>
wrote:


> I really appreciate the feedback. Thanks.
>
> That of deadlock, when the order of destruction is not preserved, is a
> point I hadn't thought of. Maybe it can be cleverly addressed.
>
> PS: If you are interested, I ran some benchmark on BLAS1 stuff and, for a
> single processor, I obtain:
>
> Example for MAXPY, with expression templates:
>
> BM_Vector_petscxx_MAXPY/8 38 ns 38 ns 18369805
>
> BM_Vector_petscxx_MAXPY/64 622 ns 622 ns 1364335
>
> BM_Vector_petscxx_MAXPY/512 281 ns 281 ns 2477718
>
> BM_Vector_petscxx_MAXPY/4096 2046 ns 2046 ns 349954
>
> BM_Vector_petscxx_MAXPY/32768 18012 ns 18012 ns 38788
>
> BM_Vector_petscxx_MAXPY_BigO 0.55 N 0.55 N
>
> BM_Vector_petscxx_MAXPY_RMS 7 % 7 %
>
> Direct call to MAXPY:
>
> BM_Vector_PETSc_MAXPY/8 33 ns 33 ns 20973674
>
> BM_Vector_PETSc_MAXPY/64 116 ns 116 ns 5992878
>
> BM_Vector_PETSc_MAXPY/512 731 ns 731 ns 963340
>
> BM_Vector_PETSc_MAXPY/4096 5739 ns 5739 ns 122414
>
> BM_Vector_PETSc_MAXPY/32768 46346 ns 46346 ns 15312
>
> BM_Vector_PETSc_MAXPY_BigO 1.41 N 1.41 N
>
> BM_Vector_PETSc_MAXPY_RMS 0 % 0 %
>
>
> And 3x speedup on 2 MPI ranks (not much communication here, anyway). I am
> now convinced that this warrants some further investigation/testing.
>

1) There is no communication here, so 3x for 2 ranks means the result is
not believable

2) I do not understand what the two cases are. Do you mean that expression
templates vectorize the
    scalar work? I am surprised that BLAS is so bad, but I guess not
completely surprised. What BLAS?

   Matt


>
> On Tue, 4 Apr 2017 at 01:08 Jed Brown <j...@jedbrown.org> wrote:
>
>> Matthew Knepley <knep...@gmail.com> writes:
>>
>> >> BLAS. (Here a interesting point opens: I assume an efficient BLAS
>> >>
>> >> implementation, but I am not so sure about how the different BLAS do
>> >> things
>> >>
>> >> internally. I work from the assumption that we have a very well tuned
>> BLAS
>> >>
>> >> implementation at our disposal).
>> >>
>> >
>> > The speed improvement comes from pulling vectors through memory fewer
>> > times by merging operations (kernel fusion).
>>
>> Typical examples are VecMAXPY and VecMDot, but note that these are not
>> xGEMV because the vectors are independent arrays rather than single
>> arrays with a constant leading dimension.
>>
>> >> call VecGetArray. However I will inevitably foget to return the array
>> to
>> >>
>> >> PETSc. I could have my new VecArray returning an object that restores
>> the
>> >>
>> >> array
>> >>
>> >> when it goes out of scope. I can also flag the function with
>> [[nodiscard]]
>> >> to
>> >>
>> >> prevent the user to destroy the returned object from the start.
>> >>
>> >
>> > Jed claims that this pattern is no longer preferred, but I have
>> forgotten
>> > his argument.
>> > Jed?
>>
>> Destruction order matters and needs to be collective.  If an error
>> condition causes destruction to occur in a different order on different
>> processes, you can get deadlock.  I would much rather have an error
>> leave some resources (for the OS to collect) than escalate into
>> deadlock.
>>
>> > We have had this discussion for years on this list. Having separate
>> names
>> > for each type
>> > is really ugly and does not achieve what we want. We want smooth
>> > interoperability between
>> > objects with different backing types, but it is still not clear how to
>> do
>> > this.
>>
>> Hide it internally and implicitly promote.  Only the *GetArray functions
>> need to be parametrized on numeric type.  But it's a lot of work on the
>> backend.
>>
>


-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener

Re: [petsc-users] PETSc with modern C++

Reply via email to