Hi All,
I've finally found a bit of time once again to work on my hobby project
with HPX... The long break actually gave me a fresh perspective on my
own code, and it occurred to me that my code has some serious issues
with memory management, and I'm hoping that someone can help to provide
me with some better insight into how to best handle memory management
while working in a distributed app. In particular, I would greatly
appreciate some specific guidance on how to address the issue in my own
code, since I'm at a bit of a loss here
At a high level (and obviously simplified), my code can be summarized as
follows:
- I'm trying to implement a genetic-algorithm-based optimization system,
which is intended to be used to optimize parameters for large numerical
models. Matrices may likely contain millions of elements.
- The matrix class itself is implemented using
hpx::serialization::serialize_buffer<T> as the backing storage type.
- Right now, I've written an HPX component which I've called
Model_Driver. My intent was to instantiate one instance of Model_Driver
on each locality with a [type-erased] Model class passed as an argument
for dependency injection.
- I intend to manage the parameter matrices on the main locality.
Model_Driver has an action called Evaluate_Model_With_Parameters(Matrix
m), which returns a future<float> representing the fitness of a given
set of parameters. This is returned back to the primary locality which
performs selection to determine the 'winning' parameters, and 'breeds'
the next generation.
- I've been using dataflow to handle the optimization process, including
selection, generating the next generation, etc.
In general, there are a large number of Matrix objects created and
destructed - there is, essentially, a necessity to use a custom
allocator to manage the allocation/deallocation of memory in the program.
The first and naive attempt that I made (currently, it's all that I've
done) is a Matrix_Data_Allocator class, which manages a memory pool.
[1] The free_list is a static object in the allocator class, and the
allocate and deallocate functions are static functions. Similarly, the
mutex is also a static member of the allocator class.
The obvious problem with this is that although it should work fine with
a single locality, it is clearly going to cause segmentation faults in
a distributed app. Although, from my understanding of the serialization
code in HPX, the transfer of a Matrix from the main locality to a
remote locality to calculate the model fitness does not use the Matrix
allocator -- allocation is handled by the serialization code, all other
constructors/destructors will be a problem.
The most obvious way to work around the problem that comes to my mind
would be changing the free_list (and mutex) into a
std::map<std::uint32_t, free_list_type> (and
std::map<std::uint32_t,mutex>) so that each locality has a separate
mutex, but something about this seems to me to be wrong -- it requires
the allocator to be tightly-coupled with the HPX runtime, so that the
allocator can call hpx::get_locality:id() to index the appropriate
free_list.
Similarly, the Model class (injected into the Model_Driver component) --
in which is where a large proportion of the Matrix allocations occurs --
also presently is not coupled at all to the HPX runtime. Although,
conceivably, Model_Driver could provide a locality_id to the Model class
(to then pass along to a Matrix?). Although my first inclination is that
a Matrix class should not have knowledge of the [distributed]
architecture on which it runs, perhaps where dealing with a distributed
program architecture, it is necessary to create distributed-type classes
-- i.e. something like class Distributed_Matrix : public Matrix {..};
explictly
Having said that, those are merely some speculations which came to mind
while trying to organize my thoughts and present this question. It is
still remains in mind, however, unclear. Something tells me that there
must be a better way to deal with this. Hopefully, people with more
brains and experience can provide me with some insight and guidance.
I would greatly appreciate any suggestions that you can offer. If you
require further details of my code, please let me know and I'd be more
than happy to elaborate further. However, I think that the problem
itself is fairly generic and is relevant to most code which is written
for a distributed environment - especially where the parallelism isn't
handled explicitly in the code (as opposed to an MPI program, for
example, where this is far more straightforward).
Thanks and best regards,
Shmuel Levine
[1] The actual code is slightly more complicated than the above
description, although I don't think that it changes the nature of the
question or the appropriate solution signifcantly. In particular, each
set of parameters is typically a std::vector<Matrix>, where each Matrix
is a different size. In other words, the code uses multiple matrix
sizes, although the number of different sizes is constrained to the
dimension of the parameter vector above. The actual allocator
definition is as follows:
class Matrix_Allocator {
public:
using T = float;
using data_type = T;
static const int64_t alignment = 64;
private:
using mutex_type = hpx::lcos::local::spinlock;
using free_list_type = std::map<int64_t, std::stack<T *>>;
using allocation_list_type = std::map<T *, int64_t>;
public:
Matrix_Allocator() {}
~Matrix_Allocator();
Matrix_Allocator(Matrix_Allocator const &) = delete;
Matrix_Allocator(Matrix_Allocator &&) = delete;
static T *allocate(int64_t n);
static void deallocate(T *p);
private:
static mutex_type mtx_;
static free_list_type free_list_;
static allocation_list_type allocation_list_;
}; // class Matrix_Allocator
The allocation_list_ is used to track the allocated size of a given
pointer, to determine to which free_list should the pointer be added
upon destruction of a matrix.
_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users