On Tue, Apr 4, 2017 at 10:25 AM, Filippo Leonardi <filippo.l...@gmail.com> wrote:
> I really appreciate the feedback. Thanks. > > That of deadlock, when the order of destruction is not preserved, is a > point I hadn't thought of. Maybe it can be cleverly addressed. > > PS: If you are interested, I ran some benchmark on BLAS1 stuff and, for a > single processor, I obtain: > > Example for MAXPY, with expression templates: > > BM_Vector_petscxx_MAXPY/8 38 ns 38 ns 18369805 > > BM_Vector_petscxx_MAXPY/64 622 ns 622 ns 1364335 > > BM_Vector_petscxx_MAXPY/512 281 ns 281 ns 2477718 > > BM_Vector_petscxx_MAXPY/4096 2046 ns 2046 ns 349954 > > BM_Vector_petscxx_MAXPY/32768 18012 ns 18012 ns 38788 > > BM_Vector_petscxx_MAXPY_BigO 0.55 N 0.55 N > > BM_Vector_petscxx_MAXPY_RMS 7 % 7 % > > Direct call to MAXPY: > > BM_Vector_PETSc_MAXPY/8 33 ns 33 ns 20973674 > > BM_Vector_PETSc_MAXPY/64 116 ns 116 ns 5992878 > > BM_Vector_PETSc_MAXPY/512 731 ns 731 ns 963340 > > BM_Vector_PETSc_MAXPY/4096 5739 ns 5739 ns 122414 > > BM_Vector_PETSc_MAXPY/32768 46346 ns 46346 ns 15312 > > BM_Vector_PETSc_MAXPY_BigO 1.41 N 1.41 N > > BM_Vector_PETSc_MAXPY_RMS 0 % 0 % > > > And 3x speedup on 2 MPI ranks (not much communication here, anyway). I am > now convinced that this warrants some further investigation/testing. > 1) There is no communication here, so 3x for 2 ranks means the result is not believable 2) I do not understand what the two cases are. Do you mean that expression templates vectorize the scalar work? I am surprised that BLAS is so bad, but I guess not completely surprised. What BLAS? Matt > > On Tue, 4 Apr 2017 at 01:08 Jed Brown <j...@jedbrown.org> wrote: > >> Matthew Knepley <knep...@gmail.com> writes: >> >> >> BLAS. (Here a interesting point opens: I assume an efficient BLAS >> >> >> >> implementation, but I am not so sure about how the different BLAS do >> >> things >> >> >> >> internally. I work from the assumption that we have a very well tuned >> BLAS >> >> >> >> implementation at our disposal). >> >> >> > >> > The speed improvement comes from pulling vectors through memory fewer >> > times by merging operations (kernel fusion). >> >> Typical examples are VecMAXPY and VecMDot, but note that these are not >> xGEMV because the vectors are independent arrays rather than single >> arrays with a constant leading dimension. >> >> >> call VecGetArray. However I will inevitably foget to return the array >> to >> >> >> >> PETSc. I could have my new VecArray returning an object that restores >> the >> >> >> >> array >> >> >> >> when it goes out of scope. I can also flag the function with >> [[nodiscard]] >> >> to >> >> >> >> prevent the user to destroy the returned object from the start. >> >> >> > >> > Jed claims that this pattern is no longer preferred, but I have >> forgotten >> > his argument. >> > Jed? >> >> Destruction order matters and needs to be collective. If an error >> condition causes destruction to occur in a different order on different >> processes, you can get deadlock. I would much rather have an error >> leave some resources (for the OS to collect) than escalate into >> deadlock. >> >> > We have had this discussion for years on this list. Having separate >> names >> > for each type >> > is really ugly and does not achieve what we want. We want smooth >> > interoperability between >> > objects with different backing types, but it is still not clear how to >> do >> > this. >> >> Hide it internally and implicitly promote. Only the *GetArray functions >> need to be parametrized on numeric type. But it's a lot of work on the >> backend. >> > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener