Re: [hpx-users] Troubleshooting (lack of) parallel execution

Michael Levine Tue, 02 May 2017 00:53:42 -0700

On Mon, May 1, 2017 at 7:28 PM, Hartmut Kaiser <[email protected]>
wrote:


>
> > GitHub link in my original email
>
> Sure, I saw that. This code is not really minimal, however... Could you
> try to reduce it down to not more than a couple of dozen lines, please?
>
>
Point is well taken.  I've put up a different test case here:
https://github.com/ShmuelLevine/hpx_matrix/blob/master/parallel/parallel_basic.cc

This code attempts to use a nested hpx::parallel::for_loop_n to fill a
std::vector<std::vector<int>> with random numbers.  Running this code
provides the following output on my system (running Debian testing, GCC
6.3, 8 vCPUs on VMware ESXi 6.5):

Case 1:       Outer: execution::par
              Inner: execution::par

Generate 1500 vectors of 150000 ints


Results:

Case 1: 3284719 μs
./random_basic_test  3.36s user 0.73s system 99% cpu 4.113 total

As a matter of comparison, I also wrote a not-so-minimal version of these
test cases, to see if there's any difference changing executors in the
inner and outer loops.  Output from that file is as follows:

shmuel@ssh01:~/src/hpx_test/parallel/build/gcc (*)
> ./random_extended_test


Generate random int vectors using hpx::async and wait_all
Finished generating random int vectors with hpx::async
Case 2:       Outer: execution::seq
              Inner: execution::seq

Case 3:       Outer: execution::seq
              Inner: execution::par

Case 4:       Outer: execution::par
              Inner: execution::par

Case 5:       Outer: execution::seq
              Inner: execution::par_unseq

Case 6:       Outer: execution::par
              Inner: execution::par_unseq

Case 7:       Outer: execution::par_unseq
              Inner: execution::par_unseq

Generate 1500 vectors of 150000 ints


Results:

Case 1: 7532637 μs
Case 2: 8727933 μs
Case 3: 9065286 μs
Case 4: 9061710 μs
Case 5: 9040860 μs
Case 6: 9014955 μs
Case 7: 8985571 μs
./random_extended_test  60.94s user 1.51s system 99% cpu 1:02.54 total

different choices of --hpx:threads actually seemed to improve the
performance when using hpx::async, but did not have a significant impact on
execution speed of hpx::parallel::for_each_n
(actual output edited for brevity)

> ./random_test --hpx:threads 2
Case 1: 3758929 μs
Case 2: 2736323 μs
Case 3: 3256966 μs
./random_test --hpx:threads 2  19.97s user 1.19s system 196% cpu 10.761
total

> ./random_test --hpx:threads 4
Case 1: 1897274 μs
Case 2: 2737472 μs
Case 3: 3265762 μs
./random_test --hpx:threads 4  32.28s user 1.27s system 377% cpu 8.893 total

> ./random_test --hpx:threads 8
Case 1: 1108285 μs
Case 2: 2729034 μs
Case 3: 3314790 μs
./random_test --hpx:threads 8  60.65s user 1.31s system 764% cpu 8.102 total

Thanks for your help,
Shmuel




> Thanks!
> Regards Hartmut
> ---------------
> http://boost-spirit.com
> http://stellar.cct.lsu.edu
>
>
> > Get Outlook for Android
> >
> >
> >2
> > On Mon, May 1, 2017 at 12:04 PM -0400, "Hartmut Kaiser"
> > <[email protected]> wrote:
> > Shmuel,
> >
> > > Thanks for the quick reply. It appears that I was not completely clear
> > in
> > > my original question. Specifically, I seem to have the same problems
> > > regardless of whether or not I'm using MKL. The separate matrix
> > > multiplication test code that I wrote was for the purposes of
> > determining
> > > whether or not MKL was the cause of these issues.
> > > Based on cpu usage and on timing of each of the three cases, I'm still
> > > finding that:
> > > 1) cpu usage is not more than 100%
> > > 2) the sequential version of the multiplication function runs faster
> > than
> > > the parallel and vectorized versions.
> > > As mentioned, changing the hpx:threads argument only adds overhead and
> > > makes the code run much slower.
> >
> > Could you give me a small test code which reproduces the problem, pease?
> >
> > Regards Hartmut
> > ---------------
> > http://boost-spirit.com
> > http://stellar.cct.lsu.edu
> >
> >
> > > Thanks
> > > From: Hartmut Kaiser
> > > Sent: Monday, May 1, 7:40 AM
> > > Subject: Re: [hpx-users] Troubleshooting (lack of) parallel execution
> > > To: [email protected]
> > > Shmuel, > I'm looking for some help in understanding why my code does
> > not
> > > appear to > be executing in parallel with the HPX system. The only
> > reason
> > > I could think of for the strange behavior you're seeing would be that
> > > you're using the parallel version of MKL. MKL is parallelized using
> > openmp
> > > and there is no way (AFAIK) to tell it to just use part of the machine.
> > So
> > > it will try to use all of the cores of the node you're running on. That
> > in
> > > turn interferes with HPX's way of binding it's worker-threads to the
> > cores
> > > itself. We have had good results when using MKL with HPX, but only if
> > you
> > > link with the sequential (non-parallel) version of MKL and leave all
> the
> > > parallelization to HPX (by scheduling more than one MKL task at the
> same
> > > time, if necessary. I have no experience with VML, but I'd assume it's
> > the
> > > same issue. HTH Regards Hartmut ---------------
> http://boost-spirit.com
> > > http://stellar.cct.lsu.edu > I've first noticed the issue while
> working
> > on
> > > my main codebase, in which > I've been trying to implement a genetic-
> > > algorithm-based optimizer for non- > linear systems. Since that code
> (at
> > > the present time) uses Intel MKL > (BLAS level 3 library functions) and
> > > VML (vector math library), in > conjunction with HPX futures, dataflow,
> > > etc., I wasn't sure if there was > some problem caused by OpenMP or
> > > something similar, which might have > prevented the code from running
> in
> > > parallel. > > I then wrote a simpler test program using only HPX
> > parallel
> > > algorithms to > implement basic matrix-matrix multiplication. I found
> > the
> > > exact same > result in both cases - my program does not appear to be
> > > running any of the > concurrent code -- neither in the case of my
> > original
> > > program using > futures, continuations, and dataflow lcos, nor in the
> > > simplified matrix > code. > > I've tried using different options for --
> > > hpx:threads, but when this number > is greater than 1, I've found that
> > the
> > > overhead of thread creation and > scheduling is exceedingly high and
> > slows
> > > down the entire program > execution. I'm not sure if that is typical
> > > behaviour -- I have tried to > ensure that the amount of computation
> > > within a given asynchronous function > call is fairly substantial so
> > that
> > > the real work is far in excess of any > overhead (although I may have
> > > under-estimated). Typically, in the case of > my code, the concurrency
> > is
> > > at the genetic-algorithm 'population' level - > for example, the
> > following
> > > code snippet is where I generate random numbers > for the crossover
> step
> > > of differential evolution. fitter_state_ is a > boost::shared_ptr. (The
> > > random number generator engines are set-up > elsewhere in the code and
> > > there are 1 for each trial vector, to ensure > that the code is thread-
> > > safe). I realize that the code below does not > need to use dataflow,
> > > although I'm skeptical that this would be the cause > for the code not
> > > running in parallel. > > size_t trial_idx = 0; > CR_population_type
> > > &CR_vector_current = > fitter_state_-
> > >crossover_vector_set_[fitter_state_-
> > > > >Current_Index()]; > > for (future_type &crossover_vector :
> > > CR_vector_current) > { > crossover_vector =
> > > hpx::dataflow(hpx::launch::async, [=]() { > auto &rng = fitter_state_-
> > > >cr_RNGs[trial_idx]; > modeling::model_fitter_aliases::CR_vector_type
> > > cr_vector_; // > cr_vector is of type std::vector >
> > > cr_vector_.reserve(total_number_of_parameters_); > >
> > > std::uniform_int_distribution CR_dist( > 0, fitter_state_-
> > > >crossover_range); > > for (int param_idx = 0; param_idx <
> > > total_number_of_parameters_; > ++param_idx) { >
> > > cr_vector_.push_back(CR_dist(rng)); > } > return cr_vector_; > }); > >
> > > trial_idx++; > } > > > From what I can tell, the above code never runs
> > in
> > > parallel (among other > things, the CPU usage drops from 500% while
> > > running MKL functions down to > 100%). Likewise, the simplistic matrix
> > > multiplication code using parallel > algorithms also only uses 100%
> CPU.
> > >
> > > > core::Matrix times_parunseq(core::Matrix &lhs, core::Matrix &rhs) { >
> > >
> > > if (lhs.Cols() != rhs.Rows()) > throw std::runtime_error("Imcompatible
> > > Matrix dimensions"); > > core::Matrix m{lhs.Rows(), rhs.Cols()}; >
> > > Col_Iterator out_iter(&m); > > // Outermost-loop -- columns of lhs and
> > > output > hpx::parallel::for_loop_n_strided( > hpx::parallel::seq, 0,
> > > rhs.Cols(), rhs.Rows(), [&](auto out_col_idx) > { > >
> > > hpx::parallel::for_loop_n( > hpx::parallel::seq, 0, lhs.Rows(),
> [&](auto
> > > out_row_idx) { > > m(out_row_idx, out_col_idx) = >
> > > hpx::parallel::transform_reduce( > hpx::parallel::par_vec,
> > > Row_Iterator(&lhs, {out_row_idx, > 0}), > Row_Iterator(&lhs,
> > {out_row_idx,
> > > lhs.Cols()}), > Col_Iterator(&rhs, {0, out_col_idx}), 0.0f, >
> > std::plus(),
> > > > [&](const float &a, const float &b) { return a * b; }); > }); > > });
> > >
> > > return m; > } > > I've tried using seq, par, par_unseq for the 2 outer
> > > loops, but that did > not make any difference in the performance. I
> > > understand that using > parallel::execution::par and
> > > parallel::execution::par_unseq just means > that the code *can* be run
> > in
> > > parallel and/or vectorized. However, I > cannot understand why the code
> > > does not actually run in parallel or using > vectorization. > > The
> > > complete code I've been using is at the link below: >
> > > https://github.com/ShmuelLevine/hpx_matrix/blob/master/
> matrix/matrix.cc
> > >
> > > > Some insights would be greatly appreciated... this is a matter of >
> > > considerable frustration to me... > > Thanks and best regards, > Shmuel
> > > _______________________________________________ hpx-users mailing list
> > > [email protected]
> > > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
>
>
>

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] Troubleshooting (lack of) parallel execution

Reply via email to