Riccardo, All of your explanations below make total sense, I concur.
Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of > Riccardo Rossi > Sent: Saturday, September 17, 2016 4:00 PM > To: Hartmut Kaiser <[email protected]> > Cc: [email protected] > Subject: RE: [hpx-users] equivalent of firstprivate > > Hi again > On 16 Sep 2016 3:27 p.m., "Hartmut Kaiser" <[email protected]> > wrote: > > > > Riccardo, > > > > > i have been thinking about your proposal, and I believe it would work > for > > > me. > > > i have a few comments: > > > 1 - i would also leave the default constructor for the value (that > would > > > allow emulating "private" and not just "firstprivate") > > > > Makes sense. However just 'private' can be done easier by having a local > variable inside the lambda. > Let me disagree on this. Passing by value should be avoided as much as > possible unless the allocation of the variable is very cheap (I have only > experience with openmp of course, but the underlying allocator is the > same). The typical use of private at least for us is to provide a thread > local scratchspace. we pass it as private and resize it the first time a > thread (I mean a hardware thread, not one of the lightweight threads that > call the lambda) need it. One of the nice side effects of this approach is > that it should be easy to have this done in a Numa friendly way. > Your proposal also allows achieving this in a very elegant way by having a > default constructor for the TLS var. Note also that to have the right > allocator one should hace the policy at hand in the construction. > Indeed I appreciate that this is a bit in contrast with your thread > concept, which as I understand is closer to that of a work item of a GPU. > hope that the usage I am thinking is not be incompatible. However I really > Don t know what happens with TLS vars when future stuff comes into > play...that might indeed lead to nasty troubles. > Regarding reductions your proposal looks fine to me. > Regarding thread_local I did some reading and found this interesting > thread > http://stackoverflow.com/questions/22794382/are-c11-thread-local- > variables-automatically-static > In there it tells that initialization is guaranteed to be threadsafe, > which at least is a good start. > However the post also tells that a variable defined as thread local is > STATIC (local to the thread) and with a life span that is linked to the > life span of the hardware threads you live with. (Apparently the standard > speaks of > thread storage duration) > Since I guess you maintain a thread pool active for the whole length of > the program, I understand this would implying that any thread_local var > would be alive for the whole get of the program, which is definitely an > unwanted side effect. > I guess this proves that my third question was indeed stupid, so sorry for > the noise > Regards > Riccardo > > > > Thanks! > > Regards Hartmut > > --------------- > > http://boost-spirit.com > > http://stellar.cct.lsu.edu > > > > > Riccardo > > > > > > > > > > > > On Mon, Sep 12, 2016 at 5:18 PM, Riccardo Rossi <[email protected]> > > > wrote: > > > Ok, > > > i think that with your proposal it must work (thanks) > > > regarding allocation, the thing is that having the allocator to know > of > > > the run policy (or the policy to know of the allocator) could allow > you to > > > do smart things concerning what to be allocated. > > > The thing is that when you read data in a finite element program even > if > > > you allocate first touch you have no way to ensure that at the moment > > > of using the data they will be used in the same order. > > > having the allocator & the policy to persist through the whole > analysis > > > gives a good way to solve this problem > > > anyway...thank you very much for your time! > > > regards > > > Riccardo > > > > > > On Mon, Sep 12, 2016 at 2:17 PM, Hartmut Kaiser > <[email protected]> > > > wrote: > > > > > > > To my understanding an openmp for loop is expanded to smthg like > > > > Vector<data_type> private_data(nthreads) > > > > Here Copy data to the private data array . Once per thread > > > > For(int block_counter = 0; block_counter<blocks) > > > > { > > > > For(begin, end, ...) > > > > { > > > > Here capture private_data[my_thread_id] > > > > ...Do work using the captured data > > > > } > > > > > > > > } > > > > This way the copying is done once per thread, not once per call nor > once > > > > per block. > > > > Of course I could emulate this, if I had access to a function like > > > > Omp_get_thread_num() giving me the I'd of the current worker (I > should > > > > also know the number of total workers to define the private_data > array). > > > > Is this data available? > > > > Please do note that I am just a user, so my understanding of the > specs > > > may > > > > be faulty. My apologies if that s the case. > > > > > > I think your understanding of OpenMP firstprivate is correct. Also > you're > > > right, the solution I gave will create one copy of the lambda per > > > iteration-partition. > > > > > > In order for having exactly one copy per kernel-thread you'd need to > > > create a helper class which allocates the per-thread data. E.g. > something > > > like: > > > > > > #include <hpx/hpx.hpp> > > > #include <vector> > > > > > > template <typename T> > > > struct firstprivate_emulation > > > { > > > explicit firstprivate_emulation(T const& init) > > > : data_(hpx::get_os_thread_count(), init) > > > { > > > } > > > > > > T& access() > > > { > > > std::size_t idx = hpx::get_worker_thread_num(); > > > HPX_ASSERT(idx < hpx::get_os_thread_count()); > > > return data_[idx]; > > > } > > > > > > T const& access() const > > > { > > > std::size_t idx = hpx::get_worker_thread_num(); > > > HPX_ASSERT(idx < hpx::get_os_thread_count()); > > > return data_[idx]; > > > } > > > > > > private: > > > std::vector<T> data_; > > > }; > > > > > > Matrix expensive_to_construct_scratchspace; > > > firstprivate_emulation<Matrix> > > > data(expensive_to_construct_scratchspace); > > > for_each(par, 0, N, > > > [&](int i) > > > { > > > // access 'data' to access thread local copy of the outer > > > Matrix > > > Matrix& m = data.access(); > > > m[i][j] = ... > > > }); > > > > > > > Btw, I much like your idea of allocators for Numa locality. That's a > > > vast > > > > improvement over first touch, where you never really know who's the > > > > owner!! > > > > > > Heh, even if it uses first touch internally itself? :-) > > > > > > HTH > > > Regards Hartmut > > > --------------- > > > http://boost-spirit.com > > > http://stellar.cct.lsu.edu > > > > > > > > > > Regards > > > > Riccardo > > > > > > > > On 11 Sep 2016 7:23 p.m., "Hartmut Kaiser" > <[email protected]> > > > > wrote: > > > > > > > > > first of all thank you very much for your quick and detailed > answer. > > > > > Nevertheless i think i did not explain my concern. > > > > > using your code snippet, imagine i have > > > > > > > > > > > > > > > int nelements = 42; > > > > > Matrix expensive_to_construct_scratchspace > > > > > > > > > > for_each(par, 0, N, > > > > > [nelements, expensive_to_construct_scratchspace](int i) > > > > > { > > > > > // the captured 'nelements' is initialized from the > outer > > > > > // variable and each copy of the lambda has its own > > > private > > > > > // copy > > > > > HERE as i understand the lambda vould capture by value my > > > > > "expensive_to_construct_scratchspace", which as i understand > implies > > > > that > > > > > i would have one allocation per every "i". --> are u telling that > this > > > > is > > > > > not the case? If so that would be a problem since constructing it > > > would > > > > be > > > > > very expensive. > > > > > > > > No, that would be the case, your analysis is correct. > > > > > > > > > On the contrary, if the lambda does not copy by value ... what if > i > > > do > > > > > need that behaviour? > > > > > > > > > > note that i could definitely construct a blocked range of > iterators > > > and > > > > > define a lambda acting on a given range of iterators, however that > > > would > > > > > be very very verbose... > > > > > > > > Looks like I misunderstood what firstprivate actually does... > > > > > > > > OTOH, in the openmp spec I read: > > > > > > > > firstprivate Specifies that each thread should have its own > instance > > > > of > > > > a variable, and that the variable should be initialized with the > > > value > > > > of the variable, because it exists before the parallel > construct. > > > > > > > > So each thread gets its own copy, which implies copying/allocation. > What > > > > do I miss? > > > > > > > > If however you want to share the variable in between threads, just > > > capture > > > > it by reference: > > > > > > > > Matrix expensive_to_construct_scratchspace > > > > for_each(par, 0, N, > > > > [&expensive_to_construct_scratchspace](int i) > > > > { > > > > }); > > > > > > > > In this case you'd be responsible for making any operations on the > > > shared > > > > variable thread safe, however. > > > > > > > > Is that what you need? > > > > > > > > Regards Hartmut > > > > --------------- > > > > http://boost-spirit.com > > > > http://stellar.cct.lsu.edu > > > > > > > > > > > > > > > > > > > > > > > anyway, > > > > > thanks again for your attention > > > > > Riccardo > > > > > > > > > > > > > > > On Sun, Sep 11, 2016 at 4:48 PM, Hartmut Kaiser > > > > <[email protected]> > > > > > wrote: > > > > > Riccardo, > > > > > > > > > > > i am writing since i am an OpenMP user, but i am > actually > > > > quite > > > > > > curious in understanding the future directions of c++. > > > > > > > > > > > > my parallel usage is actually relatively trivial, and is covered > by > > > > > OpenMP > > > > > > 2.5 (openmp 3.1 with supports for iterators would be better but > is > > > not > > > > > > available in msvc) > > > > > > 99% of my user needs are about parallel loops, and with c++11 > > > lambdas > > > > i > > > > > > could do a lot. > > > > > > > > > > Right. It is a fairly simple transformation in order to turn an > OpenMP > > > > > parallel loop into the equivalent parallel algorithm. We > specificly > > > > added > > > > > the parallel::for_loop() (not I the Parallelism TS/C++17) to > support > > > > that > > > > > migration: > > > > > > > > > > #pragma omp parallel for > > > > > for(int i = 0; i != N; ++i) > > > > > { > > > > > // some iteration > > > > > } > > > > > > > > > > Would be equivalent to > > > > > > > > > > hpx::parallel::for_loop( > > > > > hpx::parallel::par, > > > > > 0, N, [](int i) > > > > > { > > > > > // some iteration > > > > > }); > > > > > > > > > > (for more information about for_loop() see here: http://www.open- > > > > > std.org/jtc1/sc22/wg21/docs/papers/2015/p0075r0.pdf) > > > > > > > > > > > However i am really not clear on how i should equivalently > handle > > > > > > "private" and "firstprivate of OpenMP, which allow to create > objects > > > > > that > > > > > > persist in the threadprivate memory during the whole lenght of a > for > > > > > loop. > > > > > > I now use OpenMP 2.5 and i have a code that looks like the > following > > > > > > > > > > > > > > > > > > > > > > > > > https://kratos.cimne.upc.es/projects/kratos/repository/entry/kratos/kratos > > > > > > > > > > > > > > > > > > > /solving_strategies/builder_and_solvers/residualbased_block_builder_and_so > > > > > > lver.h > > > > > > which does an openmp parallel Finite Element assembly. > > > > > > The code i am thinking of is somethign like: > > > > > > > > > > [snipped code] > > > > > > > > > > > the big question is ... how shall i handle the threadprivate > > > > > scratchspace > > > > > > in HPX?? Lambdas do not allow to do this ... > > > > > > that is, what is the equivalente of private & of firstprivate?? > > > > > > thanks you in advance for any clarification or pointer to > examples > > > > > > > > > > For 'firstprivate' you can simply use lambda captures: > > > > > > > > > > int nelements = 42; > > > > > > > > > > for_each(par, 0, N, > > > > > [nelements](int i) > > > > > { > > > > > // the captured 'nelements' is initialized from the > outer > > > > > // variable and each copy of the lambda has its own > > > private > > > > > // copy > > > > > // > > > > > // use private 'nelements' here: > > > > > cout << nelements << endl; > > > > > }); > > > > > > > > > > Note, that 'nelements' will be const by default. If you want to > modify > > > > its > > > > > value, the lambda has to be made mutable: > > > > > > > > > > int nelements = 42; > > > > > > > > > > for_each(par, 0, N, > > > > > [nelements](int i) mutable // makes captures non-const > > > > > { > > > > > ++nelements; > > > > > }); > > > > > > > > > > Please don't be fooled however that this might give you one > variable > > > > > instance per iteration. HPX runs several iterations 'in one go' > > > > (depending > > > > > on the partitioning, very much like openmp), so you will create > one > > > > > variable instance per created partition. As long as you don't > modify > > > the > > > > > variable this shouldn't make a difference, however. > > > > > > > > > > Emulating 'private' is even simpler. All you need is a local > variable > > > > for > > > > > each iteration after all. Thus simply creating it on the stack > inside > > > > the > > > > > lambda is the solution: > > > > > > > > > > for_loop(par, 0, N, [](int i) > > > > > { > > > > > // create 'private' variable > > > > > int my_private = 0; > > > > > // ... > > > > > }); > > > > > > > > > > This also gives you a hint on how you can have one instance of > your > > > > > variable per iteration and still initialize it like it was > > > firstprivate: > > > > > > > > > > int nelements = 42; > > > > > for_loop(par, 0, N, [nelements](int i) > > > > > { > > > > > // create 'private' variable > > > > > int my_private = nelements; > > > > > // ... > > > > > ++my_private; // modifies instance for this iteration > only. > > > > > }); > > > > > > > > > > Things become a bit more interesting if you need reductions. > Please > > > see > > > > > the linked document above for more details, but here is a simple > > > example > > > > > (taken from that paper): > > > > > > > > > > float dot_saxpy(int n, float a, float x[], float y[]) > > > > > { > > > > > float s = 0; > > > > > for_loop(par, 0, n, > > > > > reduction(s, 0.0f, std::plus<float>()), > > > > > [&](int i, float& s_) > > > > > { > > > > > y[i] += a*x[i]; > > > > > s_ += y[i]*y[i]; > > > > > }); > > > > > return s; > > > > > } > > > > > > > > > > Here 's' is the reduction variable, and s_ is the thread-local > > > reference > > > > > to it. > > > > > > > > > > HTH > > > > > Regards Hartmut > > > > > --------------- > > > > > http://boost-spirit.com > > > > > http://stellar.cct.lsu.edu > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Riccardo Rossi > > > > > PhD, Civil Engineer > > > > > > > > > > member of the Kratos Team: www.cimne.com/kratos > > > > > Tenure Track Lecturer at Universitat Politècnica de Catalunya, > > > > > BarcelonaTech (UPC) > > > > > Full Research Professor at International Center for Numerical > Methods > > > in > > > > > Engineering (CIMNE) > > > > > > > > > > C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9 > > > > > 08034 – Barcelona – Spain – www.cimne.com - > > > > > T.(+34) 93 401 56 96 skype: rougered4 > > > > > > > > > > > > > > > > > > > > Les dades personals contingudes en aquest missatge són tractades > amb > > > la > > > > > finalitat de mantenir el contacte professional entre CIMNE i > voste. > > > > Podra > > > > > exercir els drets d'accés, rectificació, cancel·lació i oposició, > > > > > dirigint-se a [email protected]. La utilització de la seva > adreça de > > > > > correu electronic per part de CIMNE queda subjecte a les > disposicions > > > de > > > > > la Llei 34/2002, de Serveis de la Societat de la Informació i el > > > Comerç > > > > > Electronic. > > > > > Imprimiu aquest missatge, només si és estrictament necessari. > > > > > > > > > > > > -- > > > Riccardo Rossi > > > PhD, Civil Engineer > > > > > > member of the Kratos Team: www.cimne.com/kratos > > > Tenure Track Lecturer at Universitat Politècnica de Catalunya, > > > BarcelonaTech (UPC) > > > Full Research Professor at International Center for Numerical Methods > in > > > Engineering (CIMNE) > > > > > > C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9 > > > 08034 – Barcelona – Spain – www.cimne.com - > > > T.(+34) 93 401 56 96 skype: rougered4 > > > > > > > > > > > > Les dades personals contingudes en aquest missatge són tractades amb > la > > > finalitat de mantenir el contacte professional entre CIMNE i voste. > Podra > > > exercir els drets d'accés, rectificació, cancel·lació i oposició, > > > dirigint-se a [email protected]. La utilització de la seva adreça de > > > correu electronic per part de CIMNE queda subjecte a les disposicions > de > > > la Llei 34/2002, de Serveis de la Societat de la Informació i el > Comerç > > > Electronic. > > > Imprimiu aquest missatge, només si és estrictament necessari. > > > > > > > > > > > > -- > > > Riccardo Rossi > > > PhD, Civil Engineer > > > > > > member of the Kratos Team: www.cimne.com/kratos > > > Tenure Track Lecturer at Universitat Politècnica de Catalunya, > > > BarcelonaTech (UPC) > > > Full Research Professor at International Center for Numerical Methods > in > > > Engineering (CIMNE) > > > > > > C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9 > > > 08034 – Barcelona – Spain – www.cimne.com - > > > T.(+34) 93 401 56 96 skype: rougered4 > > > > > > > > > > > > Les dades personals contingudes en aquest missatge són tractades amb > la > > > finalitat de mantenir el contacte professional entre CIMNE i voste. > Podra > > > exercir els drets d'accés, rectificació, cancel·lació i oposició, > > > dirigint-se a [email protected]. La utilització de la seva adreça de > > > correu electronic per part de CIMNE queda subjecte a les disposicions > de > > > la Llei 34/2002, de Serveis de la Societat de la Informació i el > Comerç > > > Electronic. > > > Imprimiu aquest missatge, només si és estrictament necessari. > > _______________________________________________ hpx-users mailing list [email protected] https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
