Re: [hpx-users] Troubleshooting (lack of) parallel execution

shmuel . levine Mon, 01 May 2017 08:43:57 -0700


Hi Hartmut,


Thanks for the quick reply. It appears that I was not completely clear in my 
original question. Specifically, I seem to have the same problems regardless of 
whether or not I'm using MKL. The separate matrix multiplication test code that 
I wrote was for the purposes of determining whether or not MKL was the cause of 
these issues. 


Based on cpu usage and on timing of each of the three cases, I'm still finding 
that:

1) cpu usage is not more than 100%

2) the sequential version of the multiplication function runs faster than the 
parallel and vectorized versions. 


As mentioned, changing the hpx:threads argument only adds overhead and makes 
the code run much slower. 


Thanks 



From: Hartmut Kaiser

Sent: Monday, May 1, 7:40 AM

Subject: Re: [hpx-users] Troubleshooting (lack of) parallel execution

To: [email protected]



Shmuel, > I'm looking for some help in understanding why my code does not 
appear to > be executing in parallel with the HPX system. The only reason I 
could think of for the strange behavior you're seeing would be that you're 
using the parallel version of MKL. MKL is parallelized using openmp and there 
is no way (AFAIK) to tell it to just use part of the machine. So it will try to 
use all of the cores of the node you're running on. That in turn interferes 
with HPX's way of binding it's worker-threads to the cores itself. We have had 
good results when using MKL with HPX, but only if you link with the sequential 
(non-parallel) version of MKL and leave all the parallelization to HPX (by 
scheduling more than one MKL task at the same time, if necessary. I have no 
experience with VML, but I'd assume it's the same issue. HTH Regards Hartmut 
--------------- http://boost-spirit.com http://stellar.cct.lsu.edu > I've first 
noticed the issue while working on my main codebase, in which > I've been 
trying to implement a genetic-algorithm-based optimizer for non- > linear 
systems. Since that code (at the present time) uses Intel MKL > (BLAS level 3 
library functions) and VML (vector math library), in > conjunction with HPX 
futures, dataflow, etc., I wasn't sure if there was > some problem caused by 
OpenMP or something similar, which might have > prevented the code from running 
in parallel. > > I then wrote a simpler test program using only HPX parallel 
algorithms to > implement basic matrix-matrix multiplication. I found the exact 
same > result in both cases - my program does not appear to be running any of 
the > concurrent code -- neither in the case of my original program using > 
futures, continuations, and dataflow lcos, nor in the simplified matrix > code. 
> > I've tried using different options for --hpx:threads, but when this number 
> is greater than 1, I've found that the overhead of thread creation and > 
scheduling is exceedingly high and slows down the entire program > execution. 
I'm not sure if that is typical behaviour -- I have tried to > ensure that the 
amount of computation within a given asynchronous function > call is fairly 
substantial so that the real work is far in excess of any > overhead (although 
I may have under-estimated). Typically, in the case of > my code, the 
concurrency is at the genetic-algorithm 'population' level - > for example, the 
following code snippet is where I generate random numbers > for the crossover 
step of differential evolution. fitter_state_ is a > boost::shared_ptr. (The 
random number generator engines are set-up > elsewhere in the code and there 
are 1 for each trial vector, to ensure > that the code is thread-safe). I 
realize that the code below does not > need to use dataflow, although I'm 
skeptical that this would be the cause > for the code not running in parallel. 
> > size_t trial_idx = 0; > CR_population_type &CR_vector_current = > 
fitter_state_->crossover_vector_set_[fitter_state_- > >Current_Index()]; > > 
for (future_type &crossover_vector : CR_vector_current) > { > crossover_vector 
= hpx::dataflow(hpx::launch::async, [=]() { > auto &rng = 
fitter_state_->cr_RNGs[trial_idx]; > 
modeling::model_fitter_aliases::CR_vector_type cr_vector_; // > cr_vector is of 
type std::vector > cr_vector_.reserve(total_number_of_parameters_); > > 
std::uniform_int_distribution CR_dist( > 0, fitter_state_->crossover_range); > 
> for (int param_idx = 0; param_idx < total_number_of_parameters_; > 
++param_idx) { > cr_vector_.push_back(CR_dist(rng)); > } > return cr_vector_; > 
}); > > trial_idx++; > } > > > From what I can tell, the above code never runs 
in parallel (among other > things, the CPU usage drops from 500% while running 
MKL functions down to > 100%). Likewise, the simplistic matrix multiplication 
code using parallel > algorithms also only uses 100% CPU. > > core::Matrix 
times_parunseq(core::Matrix &lhs, core::Matrix &rhs) { > > if (lhs.Cols() != 
rhs.Rows()) > throw std::runtime_error("Imcompatible Matrix dimensions"); > > 
core::Matrix m{lhs.Rows(), rhs.Cols()}; > Col_Iterator out_iter(&m); > > // 
Outermost-loop -- columns of lhs and output > 
hpx::parallel::for_loop_n_strided( > hpx::parallel::seq, 0, rhs.Cols(), 
rhs.Rows(), [&](auto out_col_idx) > { > > hpx::parallel::for_loop_n( > 
hpx::parallel::seq, 0, lhs.Rows(), [&](auto out_row_idx) { > > m(out_row_idx, 
out_col_idx) = > hpx::parallel::transform_reduce( > hpx::parallel::par_vec, 
Row_Iterator(&lhs, {out_row_idx, > 0}), > Row_Iterator(&lhs, {out_row_idx, 
lhs.Cols()}), > Col_Iterator(&rhs, {0, out_col_idx}), 0.0f, > std::plus(), > 
[&](const float &a, const float &b) { return a * b; }); > }); > > }); > return 
m; > } > > I've tried using seq, par, par_unseq for the 2 outer loops, but that 
did > not make any difference in the performance. I understand that using > 
parallel::execution::par and parallel::execution::par_unseq just means > that 
the code *can* be run in parallel and/or vectorized. However, I > cannot 
understand why the code does not actually run in parallel or using > 
vectorization. > > The complete code I've been using is at the link below: > 
https://github.com/ShmuelLevine/hpx_matrix/blob/master/matrix/matrix.cc > > 
Some insights would be greatly appreciated... this is a matter of > 
considerable frustration to me... > > Thanks and best regards, > Shmuel 
_______________________________________________ hpx-users mailing list 
[email protected] 
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] Troubleshooting (lack of) parallel execution

Reply via email to