Re: [hpx-users] Memory management in a distributed app

Michael Levine Mon, 29 Aug 2016 11:29:55 -0700

Hi Thomas,

> -----Original Message-----
> From: [email protected] [mailto:hpx-users-
> [email protected]] On Behalf Of Thomas Heller
> Sent: August-29-16 6:11 AM
> To: [email protected]
> Subject: Re: [hpx-users] Memory management in a distributed app
> 
> Hi,
> 
> On 08/28/2016 06:06 PM, Shmuel Levine wrote:
> > Hi All,
> >
> > I've finally found a bit of time once again to work on my hobby
> > project with HPX...  The long break actually gave me a fresh
> > perspective on my own code, and it occurred to me that my code has
> > some serious issues with memory management, and I'm hoping that
> > someone can help to provide me with some better insight into how to
> > best handle memory management while working in a distributed app.  In
> > particular, I would greatly appreciate some specific guidance on how
> > to address the issue in my own code, since I'm at a bit of a loss here
> 
> Let me try to answer your question. I am not sure I understood everything
> correctly though...


Thanks for your thorough message.  I, as well, have a few questions on your
message.  

However, let me preface two important points.  Firstly, I don't have either
academic or professional background in CS - I'm basically self-taught.  So I
might be somewhat naïve or unsophisticated in understanding of some areas.
I apologize if this leads to any confusion.  Secondly, I think it might be
useful for me to start by clarifying the unstated assumption(s) in my last
message about the potential problems that I had thought would be an issue.

\begin {assumptions}
Firstly - here's how I had visualized the concept of memory management in
this distributed system:

To start with, I considered the case of a single locality.  There are a few
variables stored on the stack, a free_list and mutex in static variable
storage space, and a number of pointers to memory chunks on the heap.  The
pointers themselves are provided by the local machine's allocator and refer
specifically to the process's address space.  Given the list of pointers to
allocated memory P = {p1, p2, ..., pn}, every pointer in this case is valid
and can be accessed freely.  Incidentally, there shouldn't even be a
segmentation fault associated with (mis-)using  memory that has already been
allocated, since from the OS's point-of-view, the process is using an
allocated region of memory within the process's address space.

Next, I considered what would happen with separate processes, let's call
them L0 and L1.  If I understand correctly, it wouldn't matter whether these
are on separate machines or on a single machine.  
Let's say on process 0, I instantiate a bunch of Matrix objects.  These use
allocated memory segments at P = {p1, p2, ..., pn}, as before.  For this
hypothetical example, I've also finished with some of those Matrix objects
so that my free list on process 0 -- call it F_0 -- contains P', which is a
subset of P.  Next, I pass a Matrix to an action on a component residing on
process 1.  Again, the main assumption here is:
- I had assumed that the static free list would also be copied to locality
1, so that F_1 == F_0, and both contain the same list P'.

Now, the code running on L1 calls a Matrix constructor with the same static
allocator containing list F_1.  As mentioned in my above assumptions, F_1
contains pointers P' -- all of which are pointers to memory allocated within
L0 process memory space.  Considering the first pointer p' on the free_list,
on L1, the address pointed-to by p' was not allocated within L1's address
space.  As such, I assume that any access of this space would cause a
segmentation fault.

As a final underlying assumption -- following my earlier understanding that
the runtime system handles the allocation of the new matrix when, when the
matrix is de-serialized on L1 (let's call the Matrix on L1 - m1) : m1's data
is deserialized into p'', which is a pointer allocated within L1's address
space.  When m1 goes out of scope, p'' can be added to F_1 without a
problem.  Another matrix on L1 -- say m1' can safely grab p''.  

\end {assumptions}

I did not mean to ever suggest in my previous message that my design was
_correct_.  On the contrary - it causes segmentation faults - and I was
hoping for some clarification as to how to properly handle this problem.



[snip] 



> > In general, there are a large number of Matrix objects created and
> > destructed - there is, essentially, a necessity to use a custom
> > allocator to manage the allocation/deallocation of memory in the
program.
> 
> Alright.
> 
> >
> > The first and naive attempt that I made (currently, it's all that I've
> > done) is a Matrix_Data_Allocator class, which manages a memory pool.
> > [1]  The free_list is a static object in the allocator class, and the
> > allocate and deallocate functions are static functions. Similarly, the
> > mutex is also a static member of the allocator class.
> 
> Ok. A possible optimization would be to either use thread local free lists
or
> lockfree/waitfree ones.

If I understand correctly - thread-local would take more memory but would
completely eliminate contention and, therefore, the need for any mutex at
all.  Lockfree/waitfree would not necessarily use more memory, but would
prevent locking and improve performance during times of high contention.
Sound about right?

> 
> >
> > The obvious problem with this is that although it should work fine
> > with a single locality, it is clearly  going to cause segmentation
> > faults in a distributed app.  Although, from my understanding of the
> > serialization code in HPX, the  transfer of a Matrix from the main
> > locality to a remote locality to calculate the model fitness does not
> > use the Matrix allocator -- allocation is handled by the serialization
> > code, all other constructors/destructors will be a problem.
> 
> Well, what happens during serialization is that the data is copied over
the
> network and in the case of a container with dynamic size, you allocate
your
> memory and then copy the received data (inside of the archive) into the
> newly created objects.

Sorry, I'm just a little stuck on your wording "you allocate your memory and
then copy...."

As I understand from the code, the allocation of memory for the underlying
serialize_buffer member is already defined in the serialize_buffer class,
and will use the Allocator type passed as a template parameter to
serialize_buffer<T, Allocator>.  

Consequently, in my own code, I've followed the 1d_stencil_8.cpp example: I
do not use my custom allocator as a template parameter for the
serialize_buffer -- rather, the allocation is done in my Matrix_Data class
and the pointer then passed to the serialize_buffer with init_mode =
bufferA_type::take, along with a pointer to the custom deallocator function.
The Matrix_Data constructor definition is:

core::detail::Matrix_Data::Matrix_Data(int64_t elements) :
data_buffer_elements_ (elements), data_{
alloc_.allocate(data_buffer_elements_ * sizeof(data_type)),
static_cast<size_t>(data_buffer_elements_), buffer_type::take,
&Matrix_Data::deallocate} {}

n.b. the class detail::Matrix_Data is a member of my Matrix class, and it
handles the memory management for the matrix.


> 
> I don't think that creates any problems for you. The allocator you
described
> above, only carries global state (through the static variables). So the
> serialization of the allocator would essentially do nothing (Look at it as
a tag
> on which allocator to use). So when receiving a new serialize_buffer and
> deserializing it, you just allocate memory from the locality local free
list (the
> same should happen when deallocating the memory).
> 

I'm confused here by your wording:  "you just allocate memory from the
locality local free list".  Where did I get a locality local free list?  The
only thing I can think of is that I would just not include the free list in
the archive object used for serialization.  But I'm not sure if this is your
intent...  

If I'm completely mistaken here, then I'm hoping you might be able to better
clarify for me your intent.

> >
> > The most obvious way to work around the problem that comes to my mind
> > would be changing the free_list (and mutex) into a
> > std::map<std::uint32_t, free_list_type> (and
> > std::map<std::uint32_t,mutex>) so that each locality has a separate
> > mutex, but something about this seems to me to be wrong -- it requires
> > the allocator to be tightly-coupled with the HPX runtime, so that the
> > allocator can call hpx::get_locality:id() to index the appropriate
> > free_list.
> 
> I don't think that is needed at all. static variables are not part of
AGAS, they
> are local to your process.

I realize that the static variables are not part of the AGAS -- in fact,
that's exactly what's causing me the confusion here (at least in my
mind...).  To be slightly more specific, the issue in my mind isn't the
static variable itself, but what is contained within the static variable --
i.e. pointers which are valid only within a particular locality's address
space.

> 
> >
> > Similarly, the Model class (injected into the Model_Driver component)
> > -- in which is where a large proportion of the Matrix allocations
> > occurs -- also presently is not coupled at all to the HPX runtime.
> > Although, conceivably, Model_Driver could provide a locality_id to the
> > Model class (to then pass along to a Matrix?). Although my first
> > inclination is that a Matrix class should not have knowledge of the
> > [distributed] architecture on which it runs, perhaps where dealing
> > with a distributed program architecture, it is necessary to create
> > distributed-type classes
> > -- i.e. something like class Distributed_Matrix : public Matrix {..};
> > explictly Having said that, those are merely some speculations which
> > came to mind while trying to organize my thoughts and present this
> > question.  It is still remains in mind, however, unclear.  Something
> > tells me that there must be a better way to deal with this. Hopefully,
> > people with more brains and experience can provide me with some
> > insight and guidance.
> 
> I hope the description above sheds some light on it, the matrix class
doesn't
> need any locality information, unless you want to create a truly
distributed
> data structure (as opposed to just a regular container that is sent over
the
> wire).

I don't think that is what I want to do...  It would seem to me that a
Matrix class should be completely agnostic (or at least as completely as
possible) of the environment in which it is used.

> 
> >
> > I would greatly appreciate any suggestions that you can offer.  If you
> > require further details of my code, please let me know and I'd be more
> > than happy to elaborate further. However, I think that the problem
> > itself is fairly generic and is relevant to most code which is written
> > for a distributed environment - especially where the parallelism isn't
> > handled explicitly in the code (as opposed to an MPI program, for
> > example, where this is far more straightforward).
> >
> > Thanks and best regards,
> > Shmuel Levine
> >
> >
> > [1] The actual code is slightly more complicated than the above
> > description, although I don't think that it changes the nature of the
> > question or the appropriate solution signifcantly.  In particular,
> > each set of parameters is typically a std::vector<Matrix>, where each
> > Matrix is a different size. In other words, the code uses multiple
> > matrix sizes, although the number of different sizes is constrained to
> > the dimension of the parameter vector above.  The actual allocator
> > definition is as follows:
> >
> > class Matrix_Allocator {
> > public:
> >    using T = float;
> >    using data_type = T;
> >    static const int64_t alignment = 64;
> >
> > private:
> >    using mutex_type = hpx::lcos::local::spinlock;
> >    using free_list_type = std::map<int64_t, std::stack<T *>>;
> >    using allocation_list_type = std::map<T *, int64_t>;
> >
> > public:
> >    Matrix_Allocator() {}
> >    ~Matrix_Allocator();
> >    Matrix_Allocator(Matrix_Allocator const &) = delete;
> >    Matrix_Allocator(Matrix_Allocator &&) = delete;
> >
> >    static T *allocate(int64_t n);
> >    static void deallocate(T *p);
> >
> > private:
> >    static mutex_type mtx_;
> >    static free_list_type free_list_;
> >    static allocation_list_type allocation_list_;
> >
> > }; // class Matrix_Allocator
> >
> > The allocation_list_ is used to track the allocated size of a given
> > pointer, to determine to which free_list should the pointer be added
> > upon destruction of a matrix.
> >
> > _______________________________________________
> > hpx-users mailing list
> > [email protected]
> > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
> >
> 
> 
> --
> Thomas Heller
> Friedrich-Alexander-Universität Erlangen-Nürnberg Department Informatik -
> Lehrstuhl Rechnerarchitektur Martensstr. 3
> 91058 Erlangen
> Tel.: 09131/85-27018
> Fax:  09131/85-27912
> Email: [email protected]
> _______________________________________________
> hpx-users mailing list
> [email protected]
> https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] Memory management in a distributed app

Reply via email to