Re: [hpx-users] Troubleshooting (lack of) parallel execution

Hartmut Kaiser Mon, 01 May 2017 09:05:46 -0700

Shmuel,

> Thanks for the quick reply. It appears that I was not completely clear in
> my original question. Specifically, I seem to have the same problems
> regardless of whether or not I'm using MKL. The separate matrix
> multiplication test code that I wrote was for the purposes of determining
> whether or not MKL was the cause of these issues.
> Based on cpu usage and on timing of each of the three cases, I'm still
> finding that:
> 1) cpu usage is not more than 100%
> 2) the sequential version of the multiplication function runs faster than
> the parallel and vectorized versions.
> As mentioned, changing the hpx:threads argument only adds overhead and
> makes the code run much slower.


Could you give me a small test code which reproduces the problem, pease?

Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu


> Thanks
> From: Hartmut Kaiser
> Sent: Monday, May 1, 7:40 AM
> Subject: Re: [hpx-users] Troubleshooting (lack of) parallel execution
> To: [email protected]
> Shmuel, > I'm looking for some help in understanding why my code does not
> appear to > be executing in parallel with the HPX system. The only reason
> I could think of for the strange behavior you're seeing would be that
> you're using the parallel version of MKL. MKL is parallelized using openmp
> and there is no way (AFAIK) to tell it to just use part of the machine. So
> it will try to use all of the cores of the node you're running on. That in
> turn interferes with HPX's way of binding it's worker-threads to the cores
> itself. We have had good results when using MKL with HPX, but only if you
> link with the sequential (non-parallel) version of MKL and leave all the
> parallelization to HPX (by scheduling more than one MKL task at the same
> time, if necessary. I have no experience with VML, but I'd assume it's the
> same issue. HTH Regards Hartmut --------------- http://boost-spirit.com
> http://stellar.cct.lsu.edu > I've first noticed the issue while working on
> my main codebase, in which > I've been trying to implement a genetic-
> algorithm-based optimizer for non- > linear systems. Since that code (at
> the present time) uses Intel MKL > (BLAS level 3 library functions) and
> VML (vector math library), in > conjunction with HPX futures, dataflow,
> etc., I wasn't sure if there was > some problem caused by OpenMP or
> something similar, which might have > prevented the code from running in
> parallel. > > I then wrote a simpler test program using only HPX parallel
> algorithms to > implement basic matrix-matrix multiplication. I found the
> exact same > result in both cases - my program does not appear to be
> running any of the > concurrent code -- neither in the case of my original
> program using > futures, continuations, and dataflow lcos, nor in the
> simplified matrix > code. > > I've tried using different options for --
> hpx:threads, but when this number > is greater than 1, I've found that the
> overhead of thread creation and > scheduling is exceedingly high and slows
> down the entire program > execution. I'm not sure if that is typical
> behaviour -- I have tried to > ensure that the amount of computation
> within a given asynchronous function > call is fairly substantial so that
> the real work is far in excess of any > overhead (although I may have
> under-estimated). Typically, in the case of > my code, the concurrency is
> at the genetic-algorithm 'population' level - > for example, the following
> code snippet is where I generate random numbers > for the crossover step
> of differential evolution. fitter_state_ is a > boost::shared_ptr. (The
> random number generator engines are set-up > elsewhere in the code and
> there are 1 for each trial vector, to ensure > that the code is thread-
> safe). I realize that the code below does not > need to use dataflow,
> although I'm skeptical that this would be the cause > for the code not
> running in parallel. > > size_t trial_idx = 0; > CR_population_type
> &CR_vector_current = > fitter_state_->crossover_vector_set_[fitter_state_-
> > >Current_Index()]; > > for (future_type &crossover_vector :
> CR_vector_current) > { > crossover_vector =
> hpx::dataflow(hpx::launch::async, [=]() { > auto &rng = fitter_state_-
> >cr_RNGs[trial_idx]; > modeling::model_fitter_aliases::CR_vector_type
> cr_vector_; // > cr_vector is of type std::vector >
> cr_vector_.reserve(total_number_of_parameters_); > >
> std::uniform_int_distribution CR_dist( > 0, fitter_state_-
> >crossover_range); > > for (int param_idx = 0; param_idx <
> total_number_of_parameters_; > ++param_idx) { >
> cr_vector_.push_back(CR_dist(rng)); > } > return cr_vector_; > }); > >
> trial_idx++; > } > > > From what I can tell, the above code never runs in
> parallel (among other > things, the CPU usage drops from 500% while
> running MKL functions down to > 100%). Likewise, the simplistic matrix
> multiplication code using parallel > algorithms also only uses 100% CPU. >
> > core::Matrix times_parunseq(core::Matrix &lhs, core::Matrix &rhs) { > >
> if (lhs.Cols() != rhs.Rows()) > throw std::runtime_error("Imcompatible
> Matrix dimensions"); > > core::Matrix m{lhs.Rows(), rhs.Cols()}; >
> Col_Iterator out_iter(&m); > > // Outermost-loop -- columns of lhs and
> output > hpx::parallel::for_loop_n_strided( > hpx::parallel::seq, 0,
> rhs.Cols(), rhs.Rows(), [&](auto out_col_idx) > { > >
> hpx::parallel::for_loop_n( > hpx::parallel::seq, 0, lhs.Rows(), [&](auto
> out_row_idx) { > > m(out_row_idx, out_col_idx) = >
> hpx::parallel::transform_reduce( > hpx::parallel::par_vec,
> Row_Iterator(&lhs, {out_row_idx, > 0}), > Row_Iterator(&lhs, {out_row_idx,
> lhs.Cols()}), > Col_Iterator(&rhs, {0, out_col_idx}), 0.0f, > std::plus(),
> > [&](const float &a, const float &b) { return a * b; }); > }); > > }); >
> return m; > } > > I've tried using seq, par, par_unseq for the 2 outer
> loops, but that did > not make any difference in the performance. I
> understand that using > parallel::execution::par and
> parallel::execution::par_unseq just means > that the code *can* be run in
> parallel and/or vectorized. However, I > cannot understand why the code
> does not actually run in parallel or using > vectorization. > > The
> complete code I've been using is at the link below: >
> https://github.com/ShmuelLevine/hpx_matrix/blob/master/matrix/matrix.cc >
> > Some insights would be greatly appreciated... this is a matter of >
> considerable frustration to me... > > Thanks and best regards, > Shmuel
> _______________________________________________ hpx-users mailing list
> [email protected]
> https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] Troubleshooting (lack of) parallel execution

Reply via email to