Michael,

Thanks for your detailed email.

The most important lesson in our experience is that when writing
applications using HPX (or any other task based parallel library, for that
matter) it is of utmost importance to pay attention to grain-size of work.

We have shown that you can implement any type of parallelism (be it
fork-join, dataflow, task-based, general asynchronous, or continuation based
parallelism) using a system built on top of a very fine grained (task-based)
threading model. This however has the implication that, however small the
imposed overheads by those threads are, those will have tremendous impact on
the overall parallel efficiency (and naturally the achieved performance).

The overheads imposed by creating and running one HPX-thread are in the
range of 800 ns, for a full future<T> f = async(..), the overheads are in
the range of 2000-2500 ns. While those are fairly low (in fact we believe
those are almost as low as possible on today's mainstream architectures),
you still want to make sure that the useful work performed by one thread is
at least 100 times larger (in terms of execution time) than that. Only then
you can hope to effectively amortize the intrinsic overheads of thread
management (creation, scheduling, deletion, etc.).

This number (factor of 100) additionally very much depends on your concrete
application, the used architecture, etc. It also depends on factors outside
of your control, like other applications running on the same machine at the
same time, thermal regimes of the machine, etc. So what you want to be able
to in your application is to control how much work is performed on a single
thread - the grain-size of work. This in turn means that you want to write
your applications in a way allowing to control this grain-size, either
statically, or through runtime-adaptive techniques.

Generally, the smaller the grain-size of work, the better the system
utilization, but at the same time, the higher the relative overheads. At the
same time, if the grain-size becomes too large, your system utilization will
go down. So you want to be able to find a sweet-spot making your application
run as fast as possible.

The examples we show (stendil_1d) attempt to show how this could be done.

Another general aspect is that you want to be able to utilize your
compute-resources in a best possible way by removing barriers from your
execution flow as much as possible. The dataflow techniques in those
examples are used for exactly that - letting the execution go ahead in a
particular part of your code as far as your data-dependencies allow, without
imposing barriers forcing to wait for things which are irrelevant for the
next step of your algorithm (like those enforced by the conventional
programming models, like OpenMP or MPI). We often use the general technique
of over-decomposition in order to be able to keep the system busy even if
parts of the algorithm have to wait for some data dependency to be
fulfilled. Over-decomposition means having more data-items to work on than
you have compute resources available. This way, even if some computation
can't go forward at a certain point in time, some other data-item is
definitely ready to have some work performed...

This is what HPX can do for you - keep the ball rolling somewhere, even if
the current task can't proceed.

I'm not sure if this is the answer you were looking for, but here you go...

HTH
Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu


> -----Original Message-----
> From: [email protected] [mailto:hpx-users-
> [email protected]] On Behalf Of Michael Levine
> Sent: Friday, June 10, 2016 11:07 AM
> To: [email protected]
> Subject: [hpx-users] Looking for guidance / feedback on using HPX for
> parametric optimiztion with a genetic-type algorithm
> 
> Hi,
> 
> I’ve been playing around a bit with HPX and am impressed.  That said, I’m
> having a bit of trouble wrapping my head around this model of thinking and
> hoping someone can help to shed some light for me on.  Firstly, I’d
> appreciate feedback on the general architecture / approach that I am
> looking to use – there is going to be, no doubt, a number of ways to
> implement what I have in mind, and I’d appreciate some guidance from the
> HPX community as to the preferred approach.  I also have some specific
> implementation questions which I will post to the mailing list separately
> to allow them to be properly indexed.
> 
> As a pet project for experimenting with HPX, I’ve been looking to work on
> genetic-algorithm-based optimization.  In particular, I’ve been using
> Differential Evolution as the specific genetic algorithm, although I
> would, ultimately, like to ensure that whatever code I end up writing is
> as flexible as possible.
> 
> Having read (and repeatedly re-read) the futurization example in the
> documentation, I think I have a rough idea of the structure of such a
> program, but, as I said before, I’m not sure whether that idea will work
> or if there is a better way to accomplish this.
> Here’s what I’ve got so far:
> 
> - Model is an abstract base class providing an API for a model to
> parameterize
> - Model_Driver is an HPX component with a 1:1 mapping with localities
> (i.e. each locality has a driver).  The driver is constructed with the
> Model as a parameter.  The main idea behind having this Model_Driver
> object is to allow static data to be locally cached on a locality so that
> it can be used for subsequent iterations without having to recalculate
> it.  The example I had in mind here was a transformation of time series
> data into a matrix of inputs (such that each column represents a time
> step), although there is likely many other ways to achieve the same
> thing.  The Model_Driver constructor will take the following parameters:
> concrete Model type,  a cost function object, a vector of input data).
> - Model_Fitter controls the optimization process.   It will contain a
> vector of parameters –  each element of the vector represents one possible
> solution from the population of solutions.
> 
> In very broad strokes, the idea is like this:
> 
> Class Model_Fitter {
> 
> using parameter_type = std::vector<double>;
> using population_type = std::vector<parameter_type>;
> using population_set_type =
> std::vector<hpx::shared_future<population_type> >;
> 
> std::vector<Model_Driver> drivers;
> 
> static parameter_type Fit_Model(Model const&, Cost_Function const&,
> std::vector const& input_data, int population_size, /*some struct for
> optimizer options */) {
> 
> population_set_type population_set_;
> population_set_.reserve(2);
> std::vector < hpx::shared_future < std::vector < double > > >
> parameter_costs_;
> parameter_costs_.reserve(2);
> 
> for (auto& population : population_set_){
>                 population.reserve(population_size);
>                 random_initialize(population);
> }
> 
> Initialize_drivers();
> 
> // also initialize the parameter costs to max double
> 
> while ( !converged() ) {
>        Prepare_Next_Generation(++iteration_idx);
>        auto& current_parameters = parameter_set [ iteration_idx % 2];
>        auto&  parent_parameters = parameter_set [(iteration_idx + 1) % 2];
>        auto& current_costs = parameter_costs_ [ iteration_idx % 2];
>        auto&  parent_costs = parameter_costs_ [(iteration_idx + 1) % 2];
> 
> 
>        int locality_idx {0}; // for initial load-balancing
> 
>                 for (int pop_idx = 0; pop_idx < population_size;
> ++pop_idx){
>                                 auto Evaluate_Model_Op = htx::util::bind (
> &Model_Driver::Evaluate_Model_With_Parameters, drivers[locality_idx],
> hpx::util::placeholders::_1); // call static function on a particular
> driver object.  Assign calculation to different drivers to balance load.
>                                 current_parameters[pop_idx] =
> hpx::dataflow ( hpx::launch::async, [&]( ) -> parameter_type {
> 
>                                                 // Get the fitness / cost
> of a given set of parameters
>                                                 hpx::shared_future <double
> > fitness = Evaluate_Model_Op ( current_parameters [ pop_idx ] ) ;
> 
>                                                 // If this current set of
> parameters is more fit than parent, keep the child; otherwise, keep the
> parent
>                                                 if ( fitness <
> parent_costs ) {
>                                                                 current_co
> sts = fitness;
>                                                                 return
> current_parameters;
>                                                 }
>                                                 else {
>                                                                 current_co
> sts = parent_costs;
>                                                                 return
> parent_parameters;
>        }
>                                 }
>                                 ++locality_idx;
>                                 locality_idx %= num_localities;
>        }
> 
> }
> 
> I haven’t tried compiling the above code, so there’s undoubtedly various
> errors – some small and some large.  But I hope that it generally conveys
> my thought process for handling this.  Essentially:
> - Have 2 copies of the population of parameters.  This would be analogous
> to std::vector <space> in stepper in 1d_stencil example code, and as I
> understand, is necessary for the dataflow to work.
> - At this time, the parameters are all managed on the main locality.  A
> new set of possible solutions is generated on the main locality and passed
> to the driver object on each individual locality, where that set of
> parameters is used to evaluate the fitness of the model and return some
> fitness value.  The ‘fitter’ of the parent and child survives to the next
> generation.  Obviously, a different type of genetic algorithm would have a
> different strategy.
> - The function Evaluate_Model_Op is, potentially, a fairly expensive
> operation.  Although the parameters are distributed uniformly at the
> beginning of each iteration, I believe that hpx uses work stealing to
> allow a faster locality to take on a larger percentage of the evaluations.
> - As I see it, parameter_type does not necessarily need to be an HPX
> Component (the code above is, obviously not a component), as the
> distribution of work is accomplished with the driver objects.  That being
> said, however, I can see uses for this code where there may be a large
> number of parameters – might it be possible to hide network i/o latency by
> having the parameters as a component with a Get_Data() action returning a
> future to a set of parameters?
> 
> Hopefully my intention with the above code snippets and commentary is
> fairly clear.  As mentioned before, I would truly appreciate any help /
> input that anyone can offer.  I’d especially appreciate feedback as to
> whether I am approaching the problem in a convoluted or completely wrong
> manner.
> 
> 
> Thanks and best regards,
> Shmuel


_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Reply via email to