Hi,
I'm hoping that someone can give me some help with an aspect of my design,
or at least provide me some feedback for my current solution.
To start off, a bit of context on the problem (because the subject line for
this message isn't very clear).
* My program uses a set of HPX components instantiated on the nodes of
a cluster to perform hyperparameter optimization. As of now, the program
does not use a fixed number of iterations as the convergence criteria, so
the number of iterations is dynamic.
* Part of the Component's optimizer loop compiles some information
about the iteration and passes it through a channel to the controlling node
- this is meant to provide some UI / feedback to the user to show the
current state of the optimization process.
* I believe that the way that I am handling this should work, but
something about my approach just doesn't feel right.
Now, for the (simplified) details:
The distributed component class is called Distributed_Fitter. The public
interface of Distributed_Fitter mostly consists of single function (action)
called Fit_Model which returns a hpx::util::optional<Return_Type>.
The primary node instantiates a class Model_Fitter. The purpose of this
class is to (a) invoke the Fit Model action on each of the remote
components, and (b) access the UI data from an lcos::channel between (each
of) the Distributed_Fitter components and this class and display said data.
This is where I run into what I view as the complication.
Distributed_Fitter::Fit_Model() runs without any interaction with
Model_Fitter. In other words, the function Model_Fitter::Fit_Model()
basically looks like this:
std::vector<hpx::util::optional<Return_Type>> fitter_results;
for (auto& fitter_instance : distributed_fitter_nodes)
fitter_results.push_back(fitter_instance->Fit_Model());
vector<vector<future<Stats_Type>>> iteration_stats_futures;
while (!any_nodes_converged){
std::vector<hpx::future<Stats_Type>> this_iteration_stats_futures;
for (auto& channel : node_channels)
this_iteration_stats_futures.push_back(channel.get(hpx::launch::async,
current_iteration_idx));
iteration_stats_futures.push_back(std::move(this_iteration_stats_futures));
// each node will perform its iterations independently of the other nodes
- the nodes do not synchronize at each iteration
// since we know that the optimization has not converged as of
current_iteration_idx (from the loop condition), it is safe to wait on all
futures for *this iteration*
auto iteration_stats =
hpx::when_all(iteration_stats_futures[current_iteration_idx]);
/* . a series of continuations to handle the new stats - basically store
on disk and display to console . */
/* . Check all nodes for convergence via an action. Update
any_nodes_converged variable .*/
++current_iteration_idx;
} // while (!any_nodes_converged)
// synchronize Distributed_Fitter components - i.e. ensure that all nodes
complete their final iteration. After the iteration, they will
// check for global convergence before starting the next iteration. We know
at this point that one of the nodes has converged since
// it exited the while loop.
hpx::wait_all(fitter_results);
/* . Deal with fitter results. Only the converged node returns a valid
result, the others return a default-constructed hpx::util::optional */
I do not like the way that I am handling the UI-part of this function.
Having said that, I am not sure of the alternatives. My current design was
necessitated in my understanding based on the fact that the nodes are not
expected to be homogenous and so the rate that each node completes an
iteration may be different, and it would [probably] cause a major
performance issue to synchronize the iterations (unless, I think, I add some
complex load-balancing). Therefore, I simply display the stats for each
iteration once all nodes have completed that particular iteration.
I also think that I need to be careful about accessing data from a channel
for a particular iteration index since it is quite possible that the future
will never become available for one or more of the nodes. As an exaggerated
example, if node 0 converges after it completes 250 iterations while node 1
has just finished 100 iterations, trying to call channel.get on node 1's
channel with an index greater than 100 will result in a future that will
never become available (not sure if that's the right terminology).
I think that the above code is robust based on the fact that all nodes
always finish the iteration in progress even when 1 of the nodes converges.
Nevertheless, something about it leaves me a bit uncomfortable and wondering
if there might be a better way to handle this.
If it makes any difference, I can provide more details for
Distributed_Fitter - I didn't think it was relevant here.
I would really, really appreciate any input, feedback, comments, criticisms
of any kind. This is a part-time project of mine, but this has been
consuming an inordinate amount of my attention lately.
Thanks in advance,
Shmuel Levine
_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users