Re: [hpx-users] equivalent of firstprivate

Riccardo Rossi Sat, 17 Sep 2016 14:11:49 -0700

Hi again

On 16 Sep 2016 3:27 p.m., "Hartmut Kaiser" <[email protected]> wrote:
>
> Riccardo,
>
> > i have been thinking about your proposal, and I believe it would work
for
> > me.
> > i have a few  comments:
> > 1 - i would also leave the default constructor for the value (that would
> > allow emulating "private" and not just "firstprivate")
>
> Makes sense. However just 'private' can be done easier by having a local
variable inside the lambda.


Let me disagree on this. Passing by value should be avoided as much as
possible unless the allocation of the variable is very cheap (I have only
experience with  openmp of course, but the underlying allocator is the
same). The typical use of private at least for us is to provide a thread
local scratchspace. we pass it as private and resize it the first time a
thread (I mean a hardware thread, not one of the lightweight threads that
call the lambda) need it. One of the nice side effects of this approach is
that it should be easy to have this done in a Numa friendly way.

Your proposal also allows achieving this in a very elegant way by having a
default constructor for the TLS var. Note also that to have the right
allocator one should hace the policy at hand in the construction.

Indeed I appreciate that this is a bit in contrast with your thread
concept, which as I understand is closer to that of a work item of a GPU.
hope that the usage I am thinking is not be incompatible. However I really
Don t know what happens with TLS vars when future stuff comes into
play...that might indeed lead to nasty troubles.

Regarding reductions your proposal looks fine to me.

Regarding thread_local I did some reading and found this interesting thread

http://stackoverflow.com/questions/22794382/are-c11-thread-local-variables-automatically-static

In there it tells that initialization is guaranteed to be threadsafe, which
at least is a good start.

However the post also tells that a variable defined as thread local is
STATIC (local to the thread) and with a life span that is linked to the
life span of the hardware threads you live with. (Apparently the standard
speaks of
*thread storage duration*)

Since I guess you maintain a thread pool active for the whole length of the
program, I understand this would implying that any thread_local var would
be alive for the whole get of the program, which is definitely an unwanted
side effect.

I guess this proves that my third question was indeed stupid, so sorry for
the noise

Regards
Riccardo

>
> Thanks!
> Regards Hartmut
> ---------------
> http://boost-spirit.com
> http://stellar.cct.lsu.edu
>
> > Riccardo
> >
> >
> >
> > On Mon, Sep 12, 2016 at 5:18 PM, Riccardo Rossi <[email protected]>
> > wrote:
> > Ok,
> >        i think that with your proposal it must work (thanks)
> > regarding allocation, the thing is that having the allocator to know of
> > the run policy (or the policy to know of the allocator) could allow you
to
> > do smart things concerning what to be allocated.
> > The thing is that when you read data in a finite element program even if
> > you allocate first touch you have no way to ensure that at the moment
> > of using the data they will be used in the same order.
> > having the allocator & the policy to persist through the whole analysis
> > gives a good way to solve this problem
> > anyway...thank you very much for your time!
> > regards
> > Riccardo
> >
> > On Mon, Sep 12, 2016 at 2:17 PM, Hartmut Kaiser <
[email protected]>
> > wrote:
> >
> > > To my understanding an openmp for loop is expanded to smthg like
> > > Vector<data_type> private_data(nthreads)
> > > Here Copy data to the private data array . Once per thread
> > > For(int block_counter = 0; block_counter<blocks)
> > > {
> > >   For(begin, end, ...)
> > >    {
> > >               Here capture private_data[my_thread_id]
> > >                 ...Do work using the captured data
> > >     }
> > >
> > > }
> > > This way the copying is done once per thread, not once per call nor
once
> > > per block.
> > > Of course I could emulate this, if I had access to a function like
> > > Omp_get_thread_num() giving me the I'd of the current worker (I should
> > > also know the number of total workers to define the private_data
array).
> > > Is this data available?
> > > Please do note that I am just a user, so my understanding of the specs
> > may
> > > be faulty. My apologies if that s the case.
> >
> > I think your understanding of OpenMP firstprivate is correct. Also
you're
> > right, the solution I gave will create one copy of the lambda per
> > iteration-partition.
> >
> > In order for having exactly one copy per kernel-thread you'd need to
> > create a helper class which allocates the per-thread data. E.g.
something
> > like:
> >
> >     #include <hpx/hpx.hpp>
> >     #include <vector>
> >
> >     template <typename T>
> >     struct firstprivate_emulation
> >     {
> >         explicit firstprivate_emulation(T const& init)
> >           : data_(hpx::get_os_thread_count(), init)
> >         {
> >         }
> >
> >         T& access()
> >         {
> >             std::size_t idx = hpx::get_worker_thread_num();
> >             HPX_ASSERT(idx < hpx::get_os_thread_count());
> >             return data_[idx];
> >         }
> >
> >         T const& access() const
> >         {
> >             std::size_t idx = hpx::get_worker_thread_num();
> >             HPX_ASSERT(idx < hpx::get_os_thread_count());
> >             return data_[idx];
> >         }
> >
> >     private:
> >         std::vector<T> data_;
> >     };
> >
> >     Matrix expensive_to_construct_scratchspace;
> >     firstprivate_emulation<Matrix>
> > data(expensive_to_construct_scratchspace);
> >     for_each(par, 0, N,
> >         [&](int i)
> >         {
> >             // access 'data' to access thread local copy of the outer
> > Matrix
> >             Matrix& m = data.access();
> >             m[i][j] = ...
> >         });
> >
> > > Btw, I much like your idea of allocators for Numa locality. That's a
> > vast
> > > improvement over first touch, where you never really know who's the
> > > owner!!
> >
> > Heh, even if it uses first touch internally itself? :-)
> >
> > HTH
> > Regards Hartmut
> > ---------------
> > http://boost-spirit.com
> > http://stellar.cct.lsu.edu
> >
> >
> > > Regards
> > > Riccardo
> > >
> > > On 11 Sep 2016 7:23 p.m., "Hartmut Kaiser" <[email protected]>
> > > wrote:
> > >
> > > > first of all thank you very much for your quick and detailed answer.
> > > > Nevertheless i think i did not explain my concern.
> > > > using your code snippet, imagine i have
> > > >
> > > >
> > > >     int nelements = 42;
> > > >     Matrix expensive_to_construct_scratchspace
> > > >
> > > >     for_each(par, 0, N,
> > > >         [nelements, expensive_to_construct_scratchspace](int i)
> > > >         {
> > > >             // the captured 'nelements' is initialized from the
outer
> > > >             // variable and each copy of the lambda has its own
> > private
> > > >             // copy
> > > > HERE as i understand the lambda vould capture by value my
> > > > "expensive_to_construct_scratchspace", which as i understand implies
> > > that
> > > > i would have one allocation per every "i". --> are u telling that
this
> > > is
> > > > not the case? If so that would be a problem since constructing it
> > would
> > > be
> > > > very expensive.
> > >
> > > No, that would be the case, your analysis is correct.
> > >
> > > > On the contrary, if the lambda does not copy by value ...  what if i
> > do
> > > > need that behaviour?
> > > >
> > > > note that i could definitely construct a blocked range of iterators
> > and
> > > > define a lambda acting on a given range of iterators, however that
> > would
> > > > be very very verbose...
> > >
> > > Looks like I misunderstood what firstprivate actually does...
> > >
> > > OTOH, in the openmp spec I read:
> > >
> > >     firstprivate Specifies that each thread should have its own
instance
> > > of
> > >     a variable, and that the variable should be initialized with the
> > value
> > >     of the variable, because it exists before the parallel construct.
> > >
> > > So each thread gets its own copy, which implies copying/allocation.
What
> > > do I miss?
> > >
> > > If however you want to share the variable in between threads, just
> > capture
> > > it by reference:
> > >
> > >     Matrix expensive_to_construct_scratchspace
> > >     for_each(par, 0, N,
> > >         [&expensive_to_construct_scratchspace](int i)
> > >         {
> > >         });
> > >
> > > In this case you'd be responsible for making any operations on the
> > shared
> > > variable thread safe, however.
> > >
> > > Is that what you need?
> > >
> > > Regards Hartmut
> > > ---------------
> > > http://boost-spirit.com
> > > http://stellar.cct.lsu.edu
> > >
> > >
> > > >
> > > >
> > > > anyway,
> > > > thanks again for your attention
> > > > Riccardo
> > > >
> > > >
> > > > On Sun, Sep 11, 2016 at 4:48 PM, Hartmut Kaiser
> > > <[email protected]>
> > > > wrote:
> > > > Riccardo,
> > > >
> > > > >         i am writing since i am an OpenMP user, but i am actually
> > > quite
> > > > > curious in understanding the future directions of c++.
> > > > >
> > > > > my parallel usage is actually relatively trivial, and is covered
by
> > > > OpenMP
> > > > > 2.5 (openmp 3.1 with supports for iterators would be better but is
> > not
> > > > > available in msvc)
> > > > > 99% of my user needs are about parallel loops, and with c++11
> > lambdas
> > > i
> > > > > could do a lot.
> > > >
> > > > Right. It is a fairly simple transformation in order to turn an
OpenMP
> > > > parallel loop into the equivalent parallel algorithm. We specificly
> > > added
> > > > the parallel::for_loop() (not I the Parallelism TS/C++17) to support
> > > that
> > > > migration:
> > > >
> > > >     #pragma omp parallel for
> > > >     for(int i = 0; i != N; ++i)
> > > >     {
> > > >         // some iteration
> > > >     }
> > > >
> > > > Would be equivalent to
> > > >
> > > >     hpx::parallel::for_loop(
> > > >         hpx::parallel::par,
> > > >         0, N, [](int i)
> > > >         {
> > > >             // some iteration
> > > >         });
> > > >
> > > > (for more information about for_loop() see here: http://www.open-
> > > > std.org/jtc1/sc22/wg21/docs/papers/2015/p0075r0.pdf)
> > > >
> > > > > However i am really not clear on how i should equivalently handle
> > > > > "private" and "firstprivate of OpenMP, which allow to create
objects
> > > > that
> > > > > persist in the threadprivate memory during the whole lenght of a
for
> > > > loop.
> > > > > I now use OpenMP 2.5 and i have a code that looks like the
following
> > > > >
> > > > >
> > > >
> > >
> >
https://kratos.cimne.upc.es/projects/kratos/repository/entry/kratos/kratos
> > > > >
> > > >
> > >
> >
/solving_strategies/builder_and_solvers/residualbased_block_builder_and_so
> > > > > lver.h
> > > > > which does an openmp parallel Finite Element assembly.
> > > > > The code i am thinking of is somethign like:
> > > >
> > > > [snipped code]
> > > >
> > > > > the big question is ... how shall i handle the threadprivate
> > > > scratchspace
> > > > > in HPX?? Lambdas do not allow to do this ...
> > > > > that is, what is the equivalente of private & of firstprivate??
> > > > > thanks you in advance for any clarification or pointer to examples
> > > >
> > > > For 'firstprivate' you can simply use lambda captures:
> > > >
> > > >     int nelements = 42;
> > > >
> > > >     for_each(par, 0, N,
> > > >         [nelements](int i)
> > > >         {
> > > >             // the captured 'nelements' is initialized from the
outer
> > > >             // variable and each copy of the lambda has its own
> > private
> > > >             // copy
> > > >             //
> > > >             // use private 'nelements' here:
> > > >             cout << nelements << endl;
> > > >         });
> > > >
> > > > Note, that 'nelements' will be const by default. If you want to
modify
> > > its
> > > > value, the lambda has to be made mutable:
> > > >
> > > >     int nelements = 42;
> > > >
> > > >     for_each(par, 0, N,
> > > >         [nelements](int i) mutable // makes captures non-const
> > > >         {
> > > >             ++nelements;
> > > >         });
> > > >
> > > > Please don't be fooled however that this might give you one variable
> > > > instance per iteration. HPX runs several iterations 'in one go'
> > > (depending
> > > > on the partitioning, very much like openmp), so you will create one
> > > > variable instance per created partition. As long as you don't modify
> > the
> > > > variable this shouldn't make a difference, however.
> > > >
> > > > Emulating 'private' is even simpler. All you need is a local
variable
> > > for
> > > > each iteration after all. Thus simply creating it on the stack
inside
> > > the
> > > > lambda is the solution:
> > > >
> > > >     for_loop(par, 0, N, [](int i)
> > > >     {
> > > >         // create 'private' variable
> > > >         int my_private = 0;
> > > >         // ...
> > > >     });
> > > >
> > > > This also gives you a hint on how you can have one instance of your
> > > > variable per iteration and still initialize it like it was
> > firstprivate:
> > > >
> > > >     int nelements = 42;
> > > >     for_loop(par, 0, N, [nelements](int i)
> > > >     {
> > > >         // create 'private' variable
> > > >         int my_private = nelements;
> > > >         // ...
> > > >         ++my_private;   // modifies instance for this iteration
only.
> > > >     });
> > > >
> > > > Things become a bit more interesting if you need reductions. Please
> > see
> > > > the linked document above for more details, but here is a simple
> > example
> > > > (taken from that paper):
> > > >
> > > >     float dot_saxpy(int n, float a, float x[], float y[])
> > > >     {
> > > >         float s = 0;
> > > >         for_loop(par, 0, n,
> > > >             reduction(s, 0.0f, std::plus<float>()),
> > > >             [&](int i, float& s_)
> > > >             {
> > > >                 y[i] += a*x[i];
> > > >                 s_ += y[i]*y[i];
> > > >             });
> > > >         return s;
> > > >     }
> > > >
> > > > Here 's' is the reduction variable, and s_ is the thread-local
> > reference
> > > > to it.
> > > >
> > > > HTH
> > > > Regards Hartmut
> > > > ---------------
> > > > http://boost-spirit.com
> > > > http://stellar.cct.lsu.edu
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Riccardo Rossi
> > > > PhD, Civil Engineer
> > > >
> > > > member of the Kratos Team: www.cimne.com/kratos
> > > > Tenure Track Lecturer at Universitat Politècnica de Catalunya,
> > > > BarcelonaTech (UPC)
> > > > Full Research Professor at International Center for Numerical
Methods
> > in
> > > > Engineering (CIMNE)
> > > >
> > > > C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
> > > > 08034 – Barcelona – Spain – www.cimne.com  -
> > > > T.(+34) 93 401 56 96 skype: rougered4
> > > >
> > > >
> > > >
> > > > Les dades personals contingudes en aquest missatge són tractades amb
> > la
> > > > finalitat de mantenir el contacte professional entre CIMNE i voste.
> > > Podra
> > > > exercir els drets d'accés, rectificació, cancel·lació i oposició,
> > > > dirigint-se a [email protected]. La utilització de la seva adreça
de
> > > > correu electronic per part de CIMNE queda subjecte a les
disposicions
> > de
> > > > la Llei 34/2002, de Serveis de la Societat de la Informació i el
> > Comerç
> > > > Electronic.
> > > >  Imprimiu aquest missatge, només si és estrictament necessari.
> >
> >
> >
> > --
> > Riccardo Rossi
> > PhD, Civil Engineer
> >
> > member of the Kratos Team: www.cimne.com/kratos
> > Tenure Track Lecturer at Universitat Politècnica de Catalunya,
> > BarcelonaTech (UPC)
> > Full Research Professor at International Center for Numerical Methods in
> > Engineering (CIMNE)
> >
> > C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
> > 08034 – Barcelona – Spain – www.cimne.com  -
> > T.(+34) 93 401 56 96 skype: rougered4
> >
> >
> >
> > Les dades personals contingudes en aquest missatge són tractades amb la
> > finalitat de mantenir el contacte professional entre CIMNE i voste.
Podra
> > exercir els drets d'accés, rectificació, cancel·lació i oposició,
> > dirigint-se a [email protected]. La utilització de la seva adreça de
> > correu electronic per part de CIMNE queda subjecte a les disposicions de
> > la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç
> > Electronic.
> >  Imprimiu aquest missatge, només si és estrictament necessari.
> >
> >
> >
> > --
> > Riccardo Rossi
> > PhD, Civil Engineer
> >
> > member of the Kratos Team: www.cimne.com/kratos
> > Tenure Track Lecturer at Universitat Politècnica de Catalunya,
> > BarcelonaTech (UPC)
> > Full Research Professor at International Center for Numerical Methods in
> > Engineering (CIMNE)
> >
> > C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
> > 08034 – Barcelona – Spain – www.cimne.com  -
> > T.(+34) 93 401 56 96 skype: rougered4
> >
> >
> >
> > Les dades personals contingudes en aquest missatge són tractades amb la
> > finalitat de mantenir el contacte professional entre CIMNE i voste.
Podra
> > exercir els drets d'accés, rectificació, cancel·lació i oposició,
> > dirigint-se a [email protected]. La utilització de la seva adreça de
> > correu electronic per part de CIMNE queda subjecte a les disposicions de
> > la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç
> > Electronic.
> >  Imprimiu aquest missatge, només si és estrictament necessari.
>

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] equivalent of firstprivate

Reply via email to