Hi,

I'm looking for some help in understanding why my code does not appear to
be executing in parallel with the HPX system.

I've first noticed the issue while working on my main codebase, in which
I've been trying to implement a genetic-algorithm-based optimizer for
non-linear systems.  Since that code (at the present time) uses Intel MKL
(BLAS level 3 library functions) and VML (vector math library), in
conjunction with HPX futures, dataflow, etc., I wasn't sure if there was
some problem caused by OpenMP or something similar, which might have
prevented the code from running in parallel.

I then wrote a simpler test program using only HPX parallel algorithms to
implement basic matrix-matrix multiplication.  I found the exact same
result in both cases - my program does not appear to be running any of the
concurrent code -- neither in the case of my original program using
futures, continuations, and dataflow lcos, nor in the simplified matrix
code.

I've tried using different options for --hpx:threads, but when this number
is greater than 1, I've found that the overhead of thread creation and
scheduling is exceedingly high and slows down the entire program
execution.  I'm not sure if that is typical behaviour -- I have tried to
ensure that the amount of computation within a given asynchronous function
call is fairly substantial so that the real work is far in excess of any
overhead (although I may have under-estimated).  Typically, in the case of
my code, the concurrency is at the genetic-algorithm 'population' level -
for example, the following code snippet is where I generate random numbers
for the crossover step of differential evolution.  fitter_state_ is a
boost::shared_ptr.  (The random number generator engines are set-up
elsewhere in the code and there are 1 for each trial vector, to ensure that
the code is thread-safe).  I realize that the code below does not need to
use dataflow, although I'm skeptical that this would be the cause for the
code not running in parallel.

size_t trial_idx = 0;
  CR_population_type &CR_vector_current =
      fitter_state_->crossover_vector_set_[fitter_state_->Current_Index()];

  for (future_type<CR_vector_type> &crossover_vector : CR_vector_current) {
    crossover_vector = hpx::dataflow(hpx::launch::async, [=]() {
      auto &rng = fitter_state_->cr_RNGs[trial_idx];
      modeling::model_fitter_aliases::CR_vector_type cr_vector_; //
cr_vector is of type std::vector<int>
      cr_vector_.reserve(total_number_of_parameters_);

      std::uniform_int_distribution<int> CR_dist(
          0, fitter_state_->crossover_range);

      for (int param_idx = 0; param_idx < total_number_of_parameters_;
           ++param_idx) {
        cr_vector_.push_back(CR_dist(rng));
      }
      return cr_vector_;
    });

    trial_idx++;
  }


>From what I can tell, the above code never runs in parallel (among other
things, the CPU usage drops from 500% while running MKL functions down to
100%).  Likewise, the simplistic matrix multiplication code using parallel
algorithms also only uses 100% CPU.

core::Matrix times_parunseq(core::Matrix &lhs, core::Matrix &rhs) {

  if (lhs.Cols() != rhs.Rows())
    throw std::runtime_error("Imcompatible Matrix dimensions");

  core::Matrix m{lhs.Rows(), rhs.Cols()};
  Col_Iterator out_iter(&m);

  // Outermost-loop -- columns of lhs and output
  hpx::parallel::for_loop_n_strided(
      hpx::parallel::seq, 0, rhs.Cols(), rhs.Rows(), [&](auto out_col_idx) {

        hpx::parallel::for_loop_n(
            hpx::parallel::seq, 0, lhs.Rows(), [&](auto out_row_idx) {

              m(out_row_idx, out_col_idx) = hpx::parallel::transform_reduce(
                  hpx::parallel::par_vec, Row_Iterator(&lhs, {out_row_idx, 0}),
                  Row_Iterator(&lhs, {out_row_idx, lhs.Cols()}),
                  Col_Iterator(&rhs, {0, out_col_idx}), 0.0f,
                  std::plus<float>(),
                  [&](const float &a, const float &b) { return a * b; });
            });

      });
  return m;
}


I've tried using seq, par, par_unseq for the 2 outer loops, but that did
not make any difference in the performance.  I understand that using
parallel::execution::par and parallel::execution::par_unseq just means that
the code *can* be run in parallel and/or vectorized.  However, I cannot
understand why the code does not actually run in parallel or using
vectorization.

The complete code I've been using is at the link below:

https://github.com/ShmuelLevine/hpx_matrix/blob/master/matrix/matrix.cc

Some insights would be greatly appreciated... this is a matter of
considerable frustration to me...

Thanks and best regards,
Shmuel
_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Reply via email to