Re: [hpx-users] equivalent of firstprivate

Hartmut Kaiser Sat, 17 Sep 2016 14:12:46 -0700

Riccardo,

All of your explanations below make total sense, I concur.


Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu


> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of
> Riccardo Rossi
> Sent: Saturday, September 17, 2016 4:00 PM
> To: Hartmut Kaiser <[email protected]>
> Cc: [email protected]
> Subject: RE: [hpx-users] equivalent of firstprivate
> 
> Hi again
> On 16 Sep 2016 3:27 p.m., "Hartmut Kaiser" <[email protected]>
> wrote:
> >
> > Riccardo,
> >
> > > i have been thinking about your proposal, and I believe it would work
> for
> > > me.
> > > i have a few  comments:
> > > 1 - i would also leave the default constructor for the value (that
> would
> > > allow emulating "private" and not just "firstprivate")
> >
> > Makes sense. However just 'private' can be done easier by having a local
> variable inside the lambda.
> Let me disagree on this. Passing by value should be avoided as much as
> possible unless the allocation of the variable is very cheap (I have only
> experience with  openmp of course, but the underlying allocator is the
> same). The typical use of private at least for us is to provide a thread
> local scratchspace. we pass it as private and resize it the first time a
> thread (I mean a hardware thread, not one of the lightweight threads that
> call the lambda) need it. One of the nice side effects of this approach is
> that it should be easy to have this done in a Numa friendly way.
> Your proposal also allows achieving this in a very elegant way by having a
> default constructor for the TLS var. Note also that to have the right
> allocator one should hace the policy at hand in the construction.
> Indeed I appreciate that this is a bit in contrast with your thread
> concept, which as I understand is closer to that of a work item of a GPU.
> hope that the usage I am thinking is not be incompatible. However I really
> Don t know what happens with TLS vars when future stuff comes into
> play...that might indeed lead to nasty troubles.
> Regarding reductions your proposal looks fine to me.
> Regarding thread_local I did some reading and found this interesting
> thread
> http://stackoverflow.com/questions/22794382/are-c11-thread-local-
> variables-automatically-static
> In there it tells that initialization is guaranteed to be threadsafe,
> which at least is a good start.
> However the post also tells that a variable defined as thread local is
> STATIC (local to the thread) and with a life span that is linked to the
> life span of the hardware threads you live with. (Apparently the standard
> speaks of
> thread storage duration)
> Since I guess you maintain a thread pool active for the whole length of
> the program, I understand this would implying that any thread_local var
> would be alive for the whole get of the program, which is definitely an
> unwanted side effect.
> I guess this proves that my third question was indeed stupid, so sorry for
> the noise
> Regards
> Riccardo
> >
> > Thanks!
> > Regards Hartmut
> > ---------------
> > http://boost-spirit.com
> > http://stellar.cct.lsu.edu
> >
> > > Riccardo
> > >
> > >
> > >
> > > On Mon, Sep 12, 2016 at 5:18 PM, Riccardo Rossi <[email protected]>
> > > wrote:
> > > Ok,
> > >        i think that with your proposal it must work (thanks)
> > > regarding allocation, the thing is that having the allocator to know
> of
> > > the run policy (or the policy to know of the allocator) could allow
> you to
> > > do smart things concerning what to be allocated.
> > > The thing is that when you read data in a finite element program even
> if
> > > you allocate first touch you have no way to ensure that at the moment
> > > of using the data they will be used in the same order.
> > > having the allocator & the policy to persist through the whole
> analysis
> > > gives a good way to solve this problem
> > > anyway...thank you very much for your time!
> > > regards
> > > Riccardo
> > >
> > > On Mon, Sep 12, 2016 at 2:17 PM, Hartmut Kaiser
> <[email protected]>
> > > wrote:
> > >
> > > > To my understanding an openmp for loop is expanded to smthg like
> > > > Vector<data_type> private_data(nthreads)
> > > > Here Copy data to the private data array . Once per thread
> > > > For(int block_counter = 0; block_counter<blocks)
> > > > {
> > > >   For(begin, end, ...)
> > > >    {
> > > >               Here capture private_data[my_thread_id]
> > > >                 ...Do work using the captured data
> > > >     }
> > > >
> > > > }
> > > > This way the copying is done once per thread, not once per call nor
> once
> > > > per block.
> > > > Of course I could emulate this, if I had access to a function like
> > > > Omp_get_thread_num() giving me the I'd of the current worker (I
> should
> > > > also know the number of total workers to define the private_data
> array).
> > > > Is this data available?
> > > > Please do note that I am just a user, so my understanding of the
> specs
> > > may
> > > > be faulty. My apologies if that s the case.
> > >
> > > I think your understanding of OpenMP firstprivate is correct. Also
> you're
> > > right, the solution I gave will create one copy of the lambda per
> > > iteration-partition.
> > >
> > > In order for having exactly one copy per kernel-thread you'd need to
> > > create a helper class which allocates the per-thread data. E.g.
> something
> > > like:
> > >
> > >     #include <hpx/hpx.hpp>
> > >     #include <vector>
> > >
> > >     template <typename T>
> > >     struct firstprivate_emulation
> > >     {
> > >         explicit firstprivate_emulation(T const& init)
> > >           : data_(hpx::get_os_thread_count(), init)
> > >         {
> > >         }
> > >
> > >         T& access()
> > >         {
> > >             std::size_t idx = hpx::get_worker_thread_num();
> > >             HPX_ASSERT(idx < hpx::get_os_thread_count());
> > >             return data_[idx];
> > >         }
> > >
> > >         T const& access() const
> > >         {
> > >             std::size_t idx = hpx::get_worker_thread_num();
> > >             HPX_ASSERT(idx < hpx::get_os_thread_count());
> > >             return data_[idx];
> > >         }
> > >
> > >     private:
> > >         std::vector<T> data_;
> > >     };
> > >
> > >     Matrix expensive_to_construct_scratchspace;
> > >     firstprivate_emulation<Matrix>
> > > data(expensive_to_construct_scratchspace);
> > >     for_each(par, 0, N,
> > >         [&](int i)
> > >         {
> > >             // access 'data' to access thread local copy of the outer
> > > Matrix
> > >             Matrix& m = data.access();
> > >             m[i][j] = ...
> > >         });
> > >
> > > > Btw, I much like your idea of allocators for Numa locality. That's a
> > > vast
> > > > improvement over first touch, where you never really know who's the
> > > > owner!!
> > >
> > > Heh, even if it uses first touch internally itself? :-)
> > >
> > > HTH
> > > Regards Hartmut
> > > ---------------
> > > http://boost-spirit.com
> > > http://stellar.cct.lsu.edu
> > >
> > >
> > > > Regards
> > > > Riccardo
> > > >
> > > > On 11 Sep 2016 7:23 p.m., "Hartmut Kaiser"
> <[email protected]>
> > > > wrote:
> > > >
> > > > > first of all thank you very much for your quick and detailed
> answer.
> > > > > Nevertheless i think i did not explain my concern.
> > > > > using your code snippet, imagine i have
> > > > >
> > > > >
> > > > >     int nelements = 42;
> > > > >     Matrix expensive_to_construct_scratchspace
> > > > >
> > > > >     for_each(par, 0, N,
> > > > >         [nelements, expensive_to_construct_scratchspace](int i)
> > > > >         {
> > > > >             // the captured 'nelements' is initialized from the
> outer
> > > > >             // variable and each copy of the lambda has its own
> > > private
> > > > >             // copy
> > > > > HERE as i understand the lambda vould capture by value my
> > > > > "expensive_to_construct_scratchspace", which as i understand
> implies
> > > > that
> > > > > i would have one allocation per every "i". --> are u telling that
> this
> > > > is
> > > > > not the case? If so that would be a problem since constructing it
> > > would
> > > > be
> > > > > very expensive.
> > > >
> > > > No, that would be the case, your analysis is correct.
> > > >
> > > > > On the contrary, if the lambda does not copy by value ...  what if
> i
> > > do
> > > > > need that behaviour?
> > > > >
> > > > > note that i could definitely construct a blocked range of
> iterators
> > > and
> > > > > define a lambda acting on a given range of iterators, however that
> > > would
> > > > > be very very verbose...
> > > >
> > > > Looks like I misunderstood what firstprivate actually does...
> > > >
> > > > OTOH, in the openmp spec I read:
> > > >
> > > >     firstprivate Specifies that each thread should have its own
> instance
> > > > of
> > > >     a variable, and that the variable should be initialized with the
> > > value
> > > >     of the variable, because it exists before the parallel
> construct.
> > > >
> > > > So each thread gets its own copy, which implies copying/allocation.
> What
> > > > do I miss?
> > > >
> > > > If however you want to share the variable in between threads, just
> > > capture
> > > > it by reference:
> > > >
> > > >     Matrix expensive_to_construct_scratchspace
> > > >     for_each(par, 0, N,
> > > >         [&expensive_to_construct_scratchspace](int i)
> > > >         {
> > > >         });
> > > >
> > > > In this case you'd be responsible for making any operations on the
> > > shared
> > > > variable thread safe, however.
> > > >
> > > > Is that what you need?
> > > >
> > > > Regards Hartmut
> > > > ---------------
> > > > http://boost-spirit.com
> > > > http://stellar.cct.lsu.edu
> > > >
> > > >
> > > > >
> > > > >
> > > > > anyway,
> > > > > thanks again for your attention
> > > > > Riccardo
> > > > >
> > > > >
> > > > > On Sun, Sep 11, 2016 at 4:48 PM, Hartmut Kaiser
> > > > <[email protected]>
> > > > > wrote:
> > > > > Riccardo,
> > > > >
> > > > > >         i am writing since i am an OpenMP user, but i am
> actually
> > > > quite
> > > > > > curious in understanding the future directions of c++.
> > > > > >
> > > > > > my parallel usage is actually relatively trivial, and is covered
> by
> > > > > OpenMP
> > > > > > 2.5 (openmp 3.1 with supports for iterators would be better but
> is
> > > not
> > > > > > available in msvc)
> > > > > > 99% of my user needs are about parallel loops, and with c++11
> > > lambdas
> > > > i
> > > > > > could do a lot.
> > > > >
> > > > > Right. It is a fairly simple transformation in order to turn an
> OpenMP
> > > > > parallel loop into the equivalent parallel algorithm. We
> specificly
> > > > added
> > > > > the parallel::for_loop() (not I the Parallelism TS/C++17) to
> support
> > > > that
> > > > > migration:
> > > > >
> > > > >     #pragma omp parallel for
> > > > >     for(int i = 0; i != N; ++i)
> > > > >     {
> > > > >         // some iteration
> > > > >     }
> > > > >
> > > > > Would be equivalent to
> > > > >
> > > > >     hpx::parallel::for_loop(
> > > > >         hpx::parallel::par,
> > > > >         0, N, [](int i)
> > > > >         {
> > > > >             // some iteration
> > > > >         });
> > > > >
> > > > > (for more information about for_loop() see here: http://www.open-
> > > > > std.org/jtc1/sc22/wg21/docs/papers/2015/p0075r0.pdf)
> > > > >
> > > > > > However i am really not clear on how i should equivalently
> handle
> > > > > > "private" and "firstprivate of OpenMP, which allow to create
> objects
> > > > > that
> > > > > > persist in the threadprivate memory during the whole lenght of a
> for
> > > > > loop.
> > > > > > I now use OpenMP 2.5 and i have a code that looks like the
> following
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://kratos.cimne.upc.es/projects/kratos/repository/entry/kratos/kratos
> > > > > >
> > > > >
> > > >
> > >
> /solving_strategies/builder_and_solvers/residualbased_block_builder_and_so
> > > > > > lver.h
> > > > > > which does an openmp parallel Finite Element assembly.
> > > > > > The code i am thinking of is somethign like:
> > > > >
> > > > > [snipped code]
> > > > >
> > > > > > the big question is ... how shall i handle the threadprivate
> > > > > scratchspace
> > > > > > in HPX?? Lambdas do not allow to do this ...
> > > > > > that is, what is the equivalente of private & of firstprivate??
> > > > > > thanks you in advance for any clarification or pointer to
> examples
> > > > >
> > > > > For 'firstprivate' you can simply use lambda captures:
> > > > >
> > > > >     int nelements = 42;
> > > > >
> > > > >     for_each(par, 0, N,
> > > > >         [nelements](int i)
> > > > >         {
> > > > >             // the captured 'nelements' is initialized from the
> outer
> > > > >             // variable and each copy of the lambda has its own
> > > private
> > > > >             // copy
> > > > >             //
> > > > >             // use private 'nelements' here:
> > > > >             cout << nelements << endl;
> > > > >         });
> > > > >
> > > > > Note, that 'nelements' will be const by default. If you want to
> modify
> > > > its
> > > > > value, the lambda has to be made mutable:
> > > > >
> > > > >     int nelements = 42;
> > > > >
> > > > >     for_each(par, 0, N,
> > > > >         [nelements](int i) mutable // makes captures non-const
> > > > >         {
> > > > >             ++nelements;
> > > > >         });
> > > > >
> > > > > Please don't be fooled however that this might give you one
> variable
> > > > > instance per iteration. HPX runs several iterations 'in one go'
> > > > (depending
> > > > > on the partitioning, very much like openmp), so you will create
> one
> > > > > variable instance per created partition. As long as you don't
> modify
> > > the
> > > > > variable this shouldn't make a difference, however.
> > > > >
> > > > > Emulating 'private' is even simpler. All you need is a local
> variable
> > > > for
> > > > > each iteration after all. Thus simply creating it on the stack
> inside
> > > > the
> > > > > lambda is the solution:
> > > > >
> > > > >     for_loop(par, 0, N, [](int i)
> > > > >     {
> > > > >         // create 'private' variable
> > > > >         int my_private = 0;
> > > > >         // ...
> > > > >     });
> > > > >
> > > > > This also gives you a hint on how you can have one instance of
> your
> > > > > variable per iteration and still initialize it like it was
> > > firstprivate:
> > > > >
> > > > >     int nelements = 42;
> > > > >     for_loop(par, 0, N, [nelements](int i)
> > > > >     {
> > > > >         // create 'private' variable
> > > > >         int my_private = nelements;
> > > > >         // ...
> > > > >         ++my_private;   // modifies instance for this iteration
> only.
> > > > >     });
> > > > >
> > > > > Things become a bit more interesting if you need reductions.
> Please
> > > see
> > > > > the linked document above for more details, but here is a simple
> > > example
> > > > > (taken from that paper):
> > > > >
> > > > >     float dot_saxpy(int n, float a, float x[], float y[])
> > > > >     {
> > > > >         float s = 0;
> > > > >         for_loop(par, 0, n,
> > > > >             reduction(s, 0.0f, std::plus<float>()),
> > > > >             [&](int i, float& s_)
> > > > >             {
> > > > >                 y[i] += a*x[i];
> > > > >                 s_ += y[i]*y[i];
> > > > >             });
> > > > >         return s;
> > > > >     }
> > > > >
> > > > > Here 's' is the reduction variable, and s_ is the thread-local
> > > reference
> > > > > to it.
> > > > >
> > > > > HTH
> > > > > Regards Hartmut
> > > > > ---------------
> > > > > http://boost-spirit.com
> > > > > http://stellar.cct.lsu.edu
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Riccardo Rossi
> > > > > PhD, Civil Engineer
> > > > >
> > > > > member of the Kratos Team: www.cimne.com/kratos
> > > > > Tenure Track Lecturer at Universitat Politècnica de Catalunya,
> > > > > BarcelonaTech (UPC)
> > > > > Full Research Professor at International Center for Numerical
> Methods
> > > in
> > > > > Engineering (CIMNE)
> > > > >
> > > > > C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
> > > > > 08034 – Barcelona – Spain – www.cimne.com  -
> > > > > T.(+34) 93 401 56 96 skype: rougered4
> > > > >
> > > > >
> > > > >
> > > > > Les dades personals contingudes en aquest missatge són tractades
> amb
> > > la
> > > > > finalitat de mantenir el contacte professional entre CIMNE i
> voste.
> > > > Podra
> > > > > exercir els drets d'accés, rectificació, cancel·lació i oposició,
> > > > > dirigint-se a [email protected]. La utilització de la seva
> adreça de
> > > > > correu electronic per part de CIMNE queda subjecte a les
> disposicions
> > > de
> > > > > la Llei 34/2002, de Serveis de la Societat de la Informació i el
> > > Comerç
> > > > > Electronic.
> > > > >  Imprimiu aquest missatge, només si és estrictament necessari.
> > >
> > >
> > >
> > > --
> > > Riccardo Rossi
> > > PhD, Civil Engineer
> > >
> > > member of the Kratos Team: www.cimne.com/kratos
> > > Tenure Track Lecturer at Universitat Politècnica de Catalunya,
> > > BarcelonaTech (UPC)
> > > Full Research Professor at International Center for Numerical Methods
> in
> > > Engineering (CIMNE)
> > >
> > > C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
> > > 08034 – Barcelona – Spain – www.cimne.com  -
> > > T.(+34) 93 401 56 96 skype: rougered4
> > >
> > >
> > >
> > > Les dades personals contingudes en aquest missatge són tractades amb
> la
> > > finalitat de mantenir el contacte professional entre CIMNE i voste.
> Podra
> > > exercir els drets d'accés, rectificació, cancel·lació i oposició,
> > > dirigint-se a [email protected]. La utilització de la seva adreça de
> > > correu electronic per part de CIMNE queda subjecte a les disposicions
> de
> > > la Llei 34/2002, de Serveis de la Societat de la Informació i el
> Comerç
> > > Electronic.
> > >  Imprimiu aquest missatge, només si és estrictament necessari.
> > >
> > >
> > >
> > > --
> > > Riccardo Rossi
> > > PhD, Civil Engineer
> > >
> > > member of the Kratos Team: www.cimne.com/kratos
> > > Tenure Track Lecturer at Universitat Politècnica de Catalunya,
> > > BarcelonaTech (UPC)
> > > Full Research Professor at International Center for Numerical Methods
> in
> > > Engineering (CIMNE)
> > >
> > > C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
> > > 08034 – Barcelona – Spain – www.cimne.com  -
> > > T.(+34) 93 401 56 96 skype: rougered4
> > >
> > >
> > >
> > > Les dades personals contingudes en aquest missatge són tractades amb
> la
> > > finalitat de mantenir el contacte professional entre CIMNE i voste.
> Podra
> > > exercir els drets d'accés, rectificació, cancel·lació i oposició,
> > > dirigint-se a [email protected]. La utilització de la seva adreça de
> > > correu electronic per part de CIMNE queda subjecte a les disposicions
> de
> > > la Llei 34/2002, de Serveis de la Societat de la Informació i el
> Comerç
> > > Electronic.
> > >  Imprimiu aquest missatge, només si és estrictament necessari.
> >

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] equivalent of firstprivate

Reply via email to