Shortly after posting the previous message, I discovered that although there does not seem to be any performance differences between execution::seq, par, and par_unseq (at least in my own simple example code), when I use sequenced_task_policy, parallel_task_policy, and datapar_task_policy, there is, indeed a clear speed-up. In the case of the random-number generation which I was using before, the speedup between 1 and 8 threads is nearly linear (~40s with 1 thread, ~20s with 2 threads, ~11s with 4 threads).
There are obvious cache effects and I fully expect that I need to experiment to find the ideal grain size, data access patterns, and choice between which loops to use the sequenced, parallel, and vectorized executors. Shockingly, while playing with different execution policies, I discovered that in this particular example using sequenced_task_policy for the outer loop and parallel_task_policy for the inner loop was faster by more than 3 orders of magnitude: using hpx::async calls filled the vectors in approximately 2s; with the above execution policies, the nested high-level-algorithm took approximately 1.1 ms. I'm still not completely sure I understand why the code performs as described above -- I would have expected that the execution::par policy should result in the code running in parallel. Having said that, I am feeling extremely optimistic about the potential of the HPX runtime system for obtaining fantastic performance. Thanks, Shmuel On Tue, May 2, 2017 at 3:52 AM, Michael Levine <[email protected]> wrote: > > > On Mon, May 1, 2017 at 7:28 PM, Hartmut Kaiser <[email protected]> > wrote: > >> >> > GitHub link in my original email >> >> Sure, I saw that. This code is not really minimal, however... Could you >> try to reduce it down to not more than a couple of dozen lines, please? >> >> > Point is well taken. I've put up a different test case here: > https://github.com/ShmuelLevine/hpx_matrix/blob/master/parallel/parallel_ > basic.cc > > This code attempts to use a nested hpx::parallel::for_loop_n to fill a > std::vector<std::vector<int>> with random numbers. Running this code > provides the following output on my system (running Debian testing, GCC > 6.3, 8 vCPUs on VMware ESXi 6.5): > > Case 1: Outer: execution::par > Inner: execution::par > > Generate 1500 vectors of 150000 ints > > > Results: > > Case 1: 3284719 μs > ./random_basic_test 3.36s user 0.73s system 99% cpu 4.113 total > > As a matter of comparison, I also wrote a not-so-minimal version of these > test cases, to see if there's any difference changing executors in the > inner and outer loops. Output from that file is as follows: > > shmuel@ssh01:~/src/hpx_test/parallel/build/gcc (*) > > ./random_extended_test > > > Generate random int vectors using hpx::async and wait_all > Finished generating random int vectors with hpx::async > Case 2: Outer: execution::seq > Inner: execution::seq > > Case 3: Outer: execution::seq > Inner: execution::par > > Case 4: Outer: execution::par > Inner: execution::par > > Case 5: Outer: execution::seq > Inner: execution::par_unseq > > Case 6: Outer: execution::par > Inner: execution::par_unseq > > Case 7: Outer: execution::par_unseq > Inner: execution::par_unseq > > Generate 1500 vectors of 150000 ints > > > Results: > > Case 1: 7532637 μs > Case 2: 8727933 μs > Case 3: 9065286 μs > Case 4: 9061710 μs > Case 5: 9040860 μs > Case 6: 9014955 μs > Case 7: 8985571 μs > ./random_extended_test 60.94s user 1.51s system 99% cpu 1:02.54 total > > different choices of --hpx:threads actually seemed to improve the > performance when using hpx::async, but did not have a significant impact on > execution speed of hpx::parallel::for_each_n > (actual output edited for brevity) > > > ./random_test --hpx:threads 2 > Case 1: 3758929 μs > Case 2: 2736323 μs > Case 3: 3256966 μs > ./random_test --hpx:threads 2 19.97s user 1.19s system 196% cpu 10.761 > total > > > ./random_test --hpx:threads 4 > Case 1: 1897274 μs > Case 2: 2737472 μs > Case 3: 3265762 μs > ./random_test --hpx:threads 4 32.28s user 1.27s system 377% cpu 8.893 > total > > > ./random_test --hpx:threads 8 > Case 1: 1108285 μs > Case 2: 2729034 μs > Case 3: 3314790 μs > ./random_test --hpx:threads 8 60.65s user 1.31s system 764% cpu 8.102 > total > > Thanks for your help, > Shmuel > > > > >> Thanks! >> Regards Hartmut >> --------------- >> http://boost-spirit.com >> http://stellar.cct.lsu.edu >> >> >> > Get Outlook for Android >> > >> > >> >2 >> >> > On Mon, May 1, 2017 at 12:04 PM -0400, "Hartmut Kaiser" >> > <[email protected]> wrote: >> > Shmuel, >> > >> > > Thanks for the quick reply. It appears that I was not completely clear >> > in >> > > my original question. Specifically, I seem to have the same problems >> > > regardless of whether or not I'm using MKL. The separate matrix >> > > multiplication test code that I wrote was for the purposes of >> > determining >> > > whether or not MKL was the cause of these issues. >> > > Based on cpu usage and on timing of each of the three cases, I'm still >> > > finding that: >> > > 1) cpu usage is not more than 100% >> > > 2) the sequential version of the multiplication function runs faster >> > than >> > > the parallel and vectorized versions. >> > > As mentioned, changing the hpx:threads argument only adds overhead and >> > > makes the code run much slower. >> > >> > Could you give me a small test code which reproduces the problem, pease? >> > >> > Regards Hartmut >> > --------------- >> > http://boost-spirit.com >> > http://stellar.cct.lsu.edu >> > >> > >> > > Thanks >> > > From: Hartmut Kaiser >> > > Sent: Monday, May 1, 7:40 AM >> > > Subject: Re: [hpx-users] Troubleshooting (lack of) parallel execution >> > > To: [email protected] >> > > Shmuel, > I'm looking for some help in understanding why my code does >> > not >> > > appear to > be executing in parallel with the HPX system. The only >> > reason >> > > I could think of for the strange behavior you're seeing would be that >> > > you're using the parallel version of MKL. MKL is parallelized using >> > openmp >> > > and there is no way (AFAIK) to tell it to just use part of the >> machine. >> > So >> > > it will try to use all of the cores of the node you're running on. >> That >> > in >> > > turn interferes with HPX's way of binding it's worker-threads to the >> > cores >> > > itself. We have had good results when using MKL with HPX, but only if >> > you >> > > link with the sequential (non-parallel) version of MKL and leave all >> the >> > > parallelization to HPX (by scheduling more than one MKL task at the >> same >> > > time, if necessary. I have no experience with VML, but I'd assume it's >> > the >> > > same issue. HTH Regards Hartmut --------------- >> http://boost-spirit.com >> > > http://stellar.cct.lsu.edu > I've first noticed the issue while >> working >> > on >> > > my main codebase, in which > I've been trying to implement a genetic- >> > > algorithm-based optimizer for non- > linear systems. Since that code >> (at >> > > the present time) uses Intel MKL > (BLAS level 3 library functions) >> and >> > > VML (vector math library), in > conjunction with HPX futures, >> dataflow, >> > > etc., I wasn't sure if there was > some problem caused by OpenMP or >> > > something similar, which might have > prevented the code from running >> in >> > > parallel. > > I then wrote a simpler test program using only HPX >> > parallel >> > > algorithms to > implement basic matrix-matrix multiplication. I found >> > the >> > > exact same > result in both cases - my program does not appear to be >> > > running any of the > concurrent code -- neither in the case of my >> > original >> > > program using > futures, continuations, and dataflow lcos, nor in the >> > > simplified matrix > code. > > I've tried using different options for >> -- >> > > hpx:threads, but when this number > is greater than 1, I've found that >> > the >> > > overhead of thread creation and > scheduling is exceedingly high and >> > slows >> > > down the entire program > execution. I'm not sure if that is typical >> > > behaviour -- I have tried to > ensure that the amount of computation >> > > within a given asynchronous function > call is fairly substantial so >> > that >> > > the real work is far in excess of any > overhead (although I may have >> > > under-estimated). Typically, in the case of > my code, the concurrency >> > is >> > > at the genetic-algorithm 'population' level - > for example, the >> > following >> > > code snippet is where I generate random numbers > for the crossover >> step >> > > of differential evolution. fitter_state_ is a > boost::shared_ptr. >> (The >> > > random number generator engines are set-up > elsewhere in the code and >> > > there are 1 for each trial vector, to ensure > that the code is >> thread- >> > > safe). I realize that the code below does not > need to use dataflow, >> > > although I'm skeptical that this would be the cause > for the code not >> > > running in parallel. > > size_t trial_idx = 0; > CR_population_type >> > > &CR_vector_current = > fitter_state_- >> > >crossover_vector_set_[fitter_state_- >> > > > >Current_Index()]; > > for (future_type &crossover_vector : >> > > CR_vector_current) > { > crossover_vector = >> > > hpx::dataflow(hpx::launch::async, [=]() { > auto &rng = >> fitter_state_- >> > > >cr_RNGs[trial_idx]; > modeling::model_fitter_aliases::CR_vector_type >> > > cr_vector_; // > cr_vector is of type std::vector > >> > > cr_vector_.reserve(total_number_of_parameters_); > > >> > > std::uniform_int_distribution CR_dist( > 0, fitter_state_- >> > > >crossover_range); > > for (int param_idx = 0; param_idx < >> > > total_number_of_parameters_; > ++param_idx) { > >> > > cr_vector_.push_back(CR_dist(rng)); > } > return cr_vector_; > }); > >> > >> > > trial_idx++; > } > > > From what I can tell, the above code never runs >> > in >> > > parallel (among other > things, the CPU usage drops from 500% while >> > > running MKL functions down to > 100%). Likewise, the simplistic matrix >> > > multiplication code using parallel > algorithms also only uses 100% >> CPU. >> > > >> > > > core::Matrix times_parunseq(core::Matrix &lhs, core::Matrix &rhs) { >> > >> > > >> > > if (lhs.Cols() != rhs.Rows()) > throw std::runtime_error("Imcompatib >> le >> > > Matrix dimensions"); > > core::Matrix m{lhs.Rows(), rhs.Cols()}; > >> > > Col_Iterator out_iter(&m); > > // Outermost-loop -- columns of lhs and >> > > output > hpx::parallel::for_loop_n_strided( > hpx::parallel::seq, 0, >> > > rhs.Cols(), rhs.Rows(), [&](auto out_col_idx) > { > > >> > > hpx::parallel::for_loop_n( > hpx::parallel::seq, 0, lhs.Rows(), >> [&](auto >> > > out_row_idx) { > > m(out_row_idx, out_col_idx) = > >> > > hpx::parallel::transform_reduce( > hpx::parallel::par_vec, >> > > Row_Iterator(&lhs, {out_row_idx, > 0}), > Row_Iterator(&lhs, >> > {out_row_idx, >> > > lhs.Cols()}), > Col_Iterator(&rhs, {0, out_col_idx}), 0.0f, > >> > std::plus(), >> > > > [&](const float &a, const float &b) { return a * b; }); > }); > > >> }); >> > > >> > > return m; > } > > I've tried using seq, par, par_unseq for the 2 outer >> > > loops, but that did > not make any difference in the performance. I >> > > understand that using > parallel::execution::par and >> > > parallel::execution::par_unseq just means > that the code *can* be run >> > in >> > > parallel and/or vectorized. However, I > cannot understand why the >> code >> > > does not actually run in parallel or using > vectorization. > > The >> > > complete code I've been using is at the link below: > >> > > https://github.com/ShmuelLevine/hpx_matrix/blob/master/matri >> x/matrix.cc >> > > >> > > > Some insights would be greatly appreciated... this is a matter of > >> > > considerable frustration to me... > > Thanks and best regards, > >> Shmuel >> > > _______________________________________________ hpx-users mailing >> list >> > > [email protected] >> > > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users >> >> >> >
_______________________________________________ hpx-users mailing list [email protected] https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
