Re: [hpx-users] equivalent of firstprivate

Hartmut Kaiser Fri, 16 Sep 2016 06:28:08 -0700

Riccardo,

> i have been thinking about your proposal, and I believe it would work for
> me.
> i have a few  comments:
> 1 - i would also leave the default constructor for the value (that would
> allow emulating "private" and not just "firstprivate")


Makes sense. However just 'private' can be done easier by having a local 
variable inside the lambda. 

Well, except if you want to do some reduction afterwards, in which case the 
scheme I proposed is the way to go. See here for a possible implementation 
allowing for reductions: 
https://github.com/STEllAR-GROUP/hpx/blob/master/examples/quickstart/safe_object.cpp.

> 2 - shouldn't  the execution policy?
> (sequential_execution_policy/parallel_execution_policy/... ) be passed to
> this threadprivate emulator? I guess in some cases it may be convenient,
> for example because one could specialize the case of "sequential" to have
> no overhead.

Yes, sure. I made the code up just to answer your question.

> 3 - a question more than a comment: this is not doing the work of the
> "thread_local" keyword? I guess current limitations of the compilers do
> not allow that ... so take this as a forward_looking question
> thanks again for your time and patience

Using thread_local implies to create a copy of the variable for _every_ 
(kernel-)thread. The scheme I proposed will do so for the relevant threads 
only. Also, I'm simply not sure how the initialization of the thread_local 
variables might work in your case. If you gain any insights I'd love to hear 
about it, though.

Thanks!
Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu

> Riccardo
> 
> 
> 
> On Mon, Sep 12, 2016 at 5:18 PM, Riccardo Rossi <[email protected]>
> wrote:
> Ok,
>        i think that with your proposal it must work (thanks)
> regarding allocation, the thing is that having the allocator to know of
> the run policy (or the policy to know of the allocator) could allow you to
> do smart things concerning what to be allocated.
> The thing is that when you read data in a finite element program even if
> you allocate first touch you have no way to ensure that at the moment
> of using the data they will be used in the same order.
> having the allocator & the policy to persist through the whole analysis
> gives a good way to solve this problem
> anyway...thank you very much for your time!
> regards
> Riccardo
> 
> On Mon, Sep 12, 2016 at 2:17 PM, Hartmut Kaiser <[email protected]>
> wrote:
> 
> > To my understanding an openmp for loop is expanded to smthg like
> > Vector<data_type> private_data(nthreads)
> > Here Copy data to the private data array . Once per thread
> > For(int block_counter = 0; block_counter<blocks)
> > {
> >   For(begin, end, ...)
> >    {
> >               Here capture private_data[my_thread_id]
> >                 ...Do work using the captured data
> >     }
> >
> > }
> > This way the copying is done once per thread, not once per call nor once
> > per block.
> > Of course I could emulate this, if I had access to a function like
> > Omp_get_thread_num() giving me the I'd of the current worker (I should
> > also know the number of total workers to define the private_data array).
> > Is this data available?
> > Please do note that I am just a user, so my understanding of the specs
> may
> > be faulty. My apologies if that s the case.
> 
> I think your understanding of OpenMP firstprivate is correct. Also you're
> right, the solution I gave will create one copy of the lambda per
> iteration-partition.
> 
> In order for having exactly one copy per kernel-thread you'd need to
> create a helper class which allocates the per-thread data. E.g. something
> like:
> 
>     #include <hpx/hpx.hpp>
>     #include <vector>
> 
>     template <typename T>
>     struct firstprivate_emulation
>     {
>         explicit firstprivate_emulation(T const& init)
>           : data_(hpx::get_os_thread_count(), init)
>         {
>         }
> 
>         T& access()
>         {
>             std::size_t idx = hpx::get_worker_thread_num();
>             HPX_ASSERT(idx < hpx::get_os_thread_count());
>             return data_[idx];
>         }
> 
>         T const& access() const
>         {
>             std::size_t idx = hpx::get_worker_thread_num();
>             HPX_ASSERT(idx < hpx::get_os_thread_count());
>             return data_[idx];
>         }
> 
>     private:
>         std::vector<T> data_;
>     };
> 
>     Matrix expensive_to_construct_scratchspace;
>     firstprivate_emulation<Matrix>
> data(expensive_to_construct_scratchspace);
>     for_each(par, 0, N,
>         [&](int i)
>         {
>             // access 'data' to access thread local copy of the outer
> Matrix
>             Matrix& m = data.access();
>             m[i][j] = ...
>         });
> 
> > Btw, I much like your idea of allocators for Numa locality. That's a
> vast
> > improvement over first touch, where you never really know who's the
> > owner!!
> 
> Heh, even if it uses first touch internally itself? :-)
> 
> HTH
> Regards Hartmut
> ---------------
> http://boost-spirit.com
> http://stellar.cct.lsu.edu
> 
> 
> > Regards
> > Riccardo
> >
> > On 11 Sep 2016 7:23 p.m., "Hartmut Kaiser" <[email protected]>
> > wrote:
> >
> > > first of all thank you very much for your quick and detailed answer.
> > > Nevertheless i think i did not explain my concern.
> > > using your code snippet, imagine i have
> > >
> > >
> > >     int nelements = 42;
> > >     Matrix expensive_to_construct_scratchspace
> > >
> > >     for_each(par, 0, N,
> > >         [nelements, expensive_to_construct_scratchspace](int i)
> > >         {
> > >             // the captured 'nelements' is initialized from the outer
> > >             // variable and each copy of the lambda has its own
> private
> > >             // copy
> > > HERE as i understand the lambda vould capture by value my
> > > "expensive_to_construct_scratchspace", which as i understand implies
> > that
> > > i would have one allocation per every "i". --> are u telling that this
> > is
> > > not the case? If so that would be a problem since constructing it
> would
> > be
> > > very expensive.
> >
> > No, that would be the case, your analysis is correct.
> >
> > > On the contrary, if the lambda does not copy by value ...  what if i
> do
> > > need that behaviour?
> > >
> > > note that i could definitely construct a blocked range of iterators
> and
> > > define a lambda acting on a given range of iterators, however that
> would
> > > be very very verbose...
> >
> > Looks like I misunderstood what firstprivate actually does...
> >
> > OTOH, in the openmp spec I read:
> >
> >     firstprivate Specifies that each thread should have its own instance
> > of
> >     a variable, and that the variable should be initialized with the
> value
> >     of the variable, because it exists before the parallel construct.
> >
> > So each thread gets its own copy, which implies copying/allocation. What
> > do I miss?
> >
> > If however you want to share the variable in between threads, just
> capture
> > it by reference:
> >
> >     Matrix expensive_to_construct_scratchspace
> >     for_each(par, 0, N,
> >         [&expensive_to_construct_scratchspace](int i)
> >         {
> >         });
> >
> > In this case you'd be responsible for making any operations on the
> shared
> > variable thread safe, however.
> >
> > Is that what you need?
> >
> > Regards Hartmut
> > ---------------
> > http://boost-spirit.com
> > http://stellar.cct.lsu.edu
> >
> >
> > >
> > >
> > > anyway,
> > > thanks again for your attention
> > > Riccardo
> > >
> > >
> > > On Sun, Sep 11, 2016 at 4:48 PM, Hartmut Kaiser
> > <[email protected]>
> > > wrote:
> > > Riccardo,
> > >
> > > >         i am writing since i am an OpenMP user, but i am actually
> > quite
> > > > curious in understanding the future directions of c++.
> > > >
> > > > my parallel usage is actually relatively trivial, and is covered by
> > > OpenMP
> > > > 2.5 (openmp 3.1 with supports for iterators would be better but is
> not
> > > > available in msvc)
> > > > 99% of my user needs are about parallel loops, and with c++11
> lambdas
> > i
> > > > could do a lot.
> > >
> > > Right. It is a fairly simple transformation in order to turn an OpenMP
> > > parallel loop into the equivalent parallel algorithm. We specificly
> > added
> > > the parallel::for_loop() (not I the Parallelism TS/C++17) to support
> > that
> > > migration:
> > >
> > >     #pragma omp parallel for
> > >     for(int i = 0; i != N; ++i)
> > >     {
> > >         // some iteration
> > >     }
> > >
> > > Would be equivalent to
> > >
> > >     hpx::parallel::for_loop(
> > >         hpx::parallel::par,
> > >         0, N, [](int i)
> > >         {
> > >             // some iteration
> > >         });
> > >
> > > (for more information about for_loop() see here: http://www.open-
> > > std.org/jtc1/sc22/wg21/docs/papers/2015/p0075r0.pdf)
> > >
> > > > However i am really not clear on how i should equivalently handle
> > > > "private" and "firstprivate of OpenMP, which allow to create objects
> > > that
> > > > persist in the threadprivate memory during the whole lenght of a for
> > > loop.
> > > > I now use OpenMP 2.5 and i have a code that looks like the following
> > > >
> > > >
> > >
> >
> https://kratos.cimne.upc.es/projects/kratos/repository/entry/kratos/kratos
> > > >
> > >
> >
> /solving_strategies/builder_and_solvers/residualbased_block_builder_and_so
> > > > lver.h
> > > > which does an openmp parallel Finite Element assembly.
> > > > The code i am thinking of is somethign like:
> > >
> > > [snipped code]
> > >
> > > > the big question is ... how shall i handle the threadprivate
> > > scratchspace
> > > > in HPX?? Lambdas do not allow to do this ...
> > > > that is, what is the equivalente of private & of firstprivate??
> > > > thanks you in advance for any clarification or pointer to examples
> > >
> > > For 'firstprivate' you can simply use lambda captures:
> > >
> > >     int nelements = 42;
> > >
> > >     for_each(par, 0, N,
> > >         [nelements](int i)
> > >         {
> > >             // the captured 'nelements' is initialized from the outer
> > >             // variable and each copy of the lambda has its own
> private
> > >             // copy
> > >             //
> > >             // use private 'nelements' here:
> > >             cout << nelements << endl;
> > >         });
> > >
> > > Note, that 'nelements' will be const by default. If you want to modify
> > its
> > > value, the lambda has to be made mutable:
> > >
> > >     int nelements = 42;
> > >
> > >     for_each(par, 0, N,
> > >         [nelements](int i) mutable // makes captures non-const
> > >         {
> > >             ++nelements;
> > >         });
> > >
> > > Please don't be fooled however that this might give you one variable
> > > instance per iteration. HPX runs several iterations 'in one go'
> > (depending
> > > on the partitioning, very much like openmp), so you will create one
> > > variable instance per created partition. As long as you don't modify
> the
> > > variable this shouldn't make a difference, however.
> > >
> > > Emulating 'private' is even simpler. All you need is a local variable
> > for
> > > each iteration after all. Thus simply creating it on the stack inside
> > the
> > > lambda is the solution:
> > >
> > >     for_loop(par, 0, N, [](int i)
> > >     {
> > >         // create 'private' variable
> > >         int my_private = 0;
> > >         // ...
> > >     });
> > >
> > > This also gives you a hint on how you can have one instance of your
> > > variable per iteration and still initialize it like it was
> firstprivate:
> > >
> > >     int nelements = 42;
> > >     for_loop(par, 0, N, [nelements](int i)
> > >     {
> > >         // create 'private' variable
> > >         int my_private = nelements;
> > >         // ...
> > >         ++my_private;   // modifies instance for this iteration only.
> > >     });
> > >
> > > Things become a bit more interesting if you need reductions. Please
> see
> > > the linked document above for more details, but here is a simple
> example
> > > (taken from that paper):
> > >
> > >     float dot_saxpy(int n, float a, float x[], float y[])
> > >     {
> > >         float s = 0;
> > >         for_loop(par, 0, n,
> > >             reduction(s, 0.0f, std::plus<float>()),
> > >             [&](int i, float& s_)
> > >             {
> > >                 y[i] += a*x[i];
> > >                 s_ += y[i]*y[i];
> > >             });
> > >         return s;
> > >     }
> > >
> > > Here 's' is the reduction variable, and s_ is the thread-local
> reference
> > > to it.
> > >
> > > HTH
> > > Regards Hartmut
> > > ---------------
> > > http://boost-spirit.com
> > > http://stellar.cct.lsu.edu
> > >
> > >
> > >
> > >
> > > --
> > > Riccardo Rossi
> > > PhD, Civil Engineer
> > >
> > > member of the Kratos Team: www.cimne.com/kratos
> > > Tenure Track Lecturer at Universitat Politècnica de Catalunya,
> > > BarcelonaTech (UPC)
> > > Full Research Professor at International Center for Numerical Methods
> in
> > > Engineering (CIMNE)
> > >
> > > C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
> > > 08034 – Barcelona – Spain – www.cimne.com  -
> > > T.(+34) 93 401 56 96 skype: rougered4
> > >
> > >
> > >
> > > Les dades personals contingudes en aquest missatge són tractades amb
> la
> > > finalitat de mantenir el contacte professional entre CIMNE i voste.
> > Podra
> > > exercir els drets d'accés, rectificació, cancel·lació i oposició,
> > > dirigint-se a [email protected]. La utilització de la seva adreça de
> > > correu electronic per part de CIMNE queda subjecte a les disposicions
> de
> > > la Llei 34/2002, de Serveis de la Societat de la Informació i el
> Comerç
> > > Electronic.
> > >  Imprimiu aquest missatge, només si és estrictament necessari.
> 
> 
> 
> --
> Riccardo Rossi
> PhD, Civil Engineer
> 
> member of the Kratos Team: www.cimne.com/kratos
> Tenure Track Lecturer at Universitat Politècnica de Catalunya,
> BarcelonaTech (UPC)
> Full Research Professor at International Center for Numerical Methods in
> Engineering (CIMNE)
> 
> C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
> 08034 – Barcelona – Spain – www.cimne.com  -
> T.(+34) 93 401 56 96 skype: rougered4
> 
> 
> 
> Les dades personals contingudes en aquest missatge són tractades amb la
> finalitat de mantenir el contacte professional entre CIMNE i voste. Podra
> exercir els drets d'accés, rectificació, cancel·lació i oposició,
> dirigint-se a [email protected]. La utilització de la seva adreça de
> correu electronic per part de CIMNE queda subjecte a les disposicions de
> la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç
> Electronic.
>  Imprimiu aquest missatge, només si és estrictament necessari.
> 
> 
> 
> --
> Riccardo Rossi
> PhD, Civil Engineer
> 
> member of the Kratos Team: www.cimne.com/kratos
> Tenure Track Lecturer at Universitat Politècnica de Catalunya,
> BarcelonaTech (UPC)
> Full Research Professor at International Center for Numerical Methods in
> Engineering (CIMNE)
> 
> C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
> 08034 – Barcelona – Spain – www.cimne.com  -
> T.(+34) 93 401 56 96 skype: rougered4
> 
> 
> 
> Les dades personals contingudes en aquest missatge són tractades amb la
> finalitat de mantenir el contacte professional entre CIMNE i voste. Podra
> exercir els drets d'accés, rectificació, cancel·lació i oposició,
> dirigint-se a [email protected]. La utilització de la seva adreça de
> correu electronic per part de CIMNE queda subjecte a les disposicions de
> la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç
> Electronic.
>  Imprimiu aquest missatge, només si és estrictament necessari.

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] equivalent of firstprivate

Reply via email to