Re: [hpx-users] Memory management in a distributed app

Michael Levine Mon, 29 Aug 2016 13:04:56 -0700

So, if I've finally understood what you're telling me-

Class A{
static int i_;
};


Class B{
static A a_;
/* various state */
};

Locality 0 keeps a version of B::a_ in the process's static variable memory
space.
Locality 1 keeps a completely separate version of B::a_ in its own process
static variable memory space.

Any object of type B constructed on locality 0 - either in a locally-called
function or through a message deserialized on locality 0 - will always refer
to the same static object B::a_.
Any object of type B constructed on locality 1 - either in a locally-called
function or through a message deserialized on locality 1 - will always refer
to the same static object B::a_.

In other words, by default, any global object that is not explicitly
synchronized somehow , will be locality-local.  

Does that sound about right?


> -----Original Message-----
> From: [email protected] [mailto:hpx-users-
> [email protected]] On Behalf Of Thomas Heller
> Sent: August-29-16 3:32 PM
> To: [email protected]
> Subject: Re: [hpx-users] Memory management in a distributed app
> 
> On Montag, 29. August 2016 14:29:32 CEST Michael Levine wrote:
> > Hi Thomas,
> >
> > > -----Original Message-----
> > > From: [email protected] [mailto:hpx-users-
> > > [email protected]] On Behalf Of Thomas Heller
> > > Sent: August-29-16 6:11 AM
> > > To: [email protected]
> > > Subject: Re: [hpx-users] Memory management in a distributed app
> > >
> > > Hi,
> > >
> > > On 08/28/2016 06:06 PM, Shmuel Levine wrote:
> > > > Hi All,
> > > >
> > > > I've finally found a bit of time once again to work on my hobby
> > > > project with HPX...  The long break actually gave me a fresh
> > > > perspective on my own code, and it occurred to me that my code has
> > > > some serious issues with memory management, and I'm hoping that
> > > > someone can help to provide me with some better insight into how
> > > > to best handle memory management while working in a distributed
> > > > app.  In particular, I would greatly appreciate some specific
> > > > guidance on how to address the issue in my own code, since I'm at
> > > > a bit of a loss here
> > >
> > > Let me try to answer your question. I am not sure I understood
> > > everything correctly though...
> >
> > Thanks for your thorough message.  I, as well, have a few questions on
> > your message.
> >
> > However, let me preface two important points.  Firstly, I don't have
> > either academic or professional background in CS - I'm basically
> > self-taught.  So I might be somewhat naïve or unsophisticated in
> understanding of some areas.
> > I apologize if this leads to any confusion.  Secondly, I think it
> > might be useful for me to start by clarifying the unstated
> > assumption(s) in my last message about the potential problems that I had
> thought would be an issue.
> >
> > \begin {assumptions}
> > Firstly - here's how I had visualized the concept of memory management
> > in this distributed system:
> >
> > To start with, I considered the case of a single locality.  There are
> > a few variables stored on the stack, a free_list and mutex in static
> > variable storage space, and a number of pointers to memory chunks on
> > the heap.  The pointers themselves are provided by the local machine's
> > allocator and refer specifically to the process's address space.
> > Given the list of pointers to allocated memory P = {p1, p2, ..., pn},
> > every pointer in this case is valid and can be accessed freely.
> > Incidentally, there shouldn't even be a segmentation fault associated
> > with (mis-)using  memory that has already been allocated, since from
> > the OS's point-of-view, the process is using an allocated region of
memory
> within the process's address space.
> >
> > Next, I considered what would happen with separate processes, let's
> > call them L0 and L1.  If I understand correctly, it wouldn't matter
> > whether these are on separate machines or on a single machine.
> > Let's say on process 0, I instantiate a bunch of Matrix objects.
> > These use allocated memory segments at P = {p1, p2, ..., pn}, as
> > before.  For this hypothetical example, I've also finished with some
> > of those Matrix objects so that my free list on process 0 -- call it
> > F_0 -- contains P', which is a subset of P.  Next, I pass a Matrix to
> > an action on a component residing on process 1.  Again, the main
> assumption here is:
> > - I had assumed that the static free list would also be copied to
> > locality 1, so that F_1 == F_0, and both contain the same list P'.
> >
> > Now, the code running on L1 calls a Matrix constructor with the same
> > static allocator containing list F_1.  As mentioned in my above
> > assumptions, F_1 contains pointers P' -- all of which are pointers to
> > memory allocated within
> > L0 process memory space.  Considering the first pointer p' on the
> > free_list, on L1, the address pointed-to by p' was not allocated
> > within L1's address space.  As such, I assume that any access of this
> > space would cause a segmentation fault.
> >
> > As a final underlying assumption -- following my earlier understanding
> > that the runtime system handles the allocation of the new matrix when,
> > when the matrix is de-serialized on L1 (let's call the Matrix on L1 -
> > m1) : m1's data is deserialized into p'', which is a pointer allocated
> > within L1's address space.  When m1 goes out of scope, p'' can be
> > added to F_1 without a problem.  Another matrix on L1 -- say m1' can
> safely grab p''.
> >
> > \end {assumptions}
> 
> please take a look at this pseudo code:
> Please take a look here:
> template <typename T>
> struct allocator
> {
>     T *allocate(std::size_t count)
>     {
>         return free_list_.get(count);
>     }
> 
>     void deallocate(T* p, std::size_t count)
>     {
>         free_list_.push(p count);
>     }
> 
>     static free_list<T> free_list_;
> 
>     template <typename Archive>
>     void serialize(Archive& ar, unsigned)
>     {
>         // This is empty, we don't really maintain state.
>     }
> };
> 
> void f(serialize_buffer<float, allocator<float> > buf) { }
> 
> HPX_PLAIN_ACTION(f)
> 
> void g()
> {
>     // Crate a serialization buffer object.
>     serialize_buffer<float, allocator<float> > buf(100);
>     // We now have 100 floats allocated using our allocator.
>     //
>     // The source locality (L0) has it's own static free_list<T> F_0,
which might
>     // contain various entries.
>     //
>     // The memory allocated for buf is now pointed to by the valid pointer
P_0.
>     //
>     // We want to call f on another locality...
>     id_type there = ...;
>     // Buf will now get sent to 'there'. What will happen is that we now
copy
>     // the content of buf over the network to 'there'.
>     // Once 'there' received this message (we call that message parcel),
it
> needs
>     // to desiarlize it. In order to do that, it needs to allocate memory
for
>     // 100 floats, using our allocator with its own process-private free
list.
>     // There is no need that the two localities need to share the pointers
in the
>     // free list. Compare it to if you have a thread-local free list.
>     f_action()(there, buf);
> }
> 
> >
> > I did not mean to ever suggest in my previous message that my design
> > was _correct_.  On the contrary - it causes segmentation faults - and
> > I was hoping for some clarification as to how to properly handle this
> problem.
> 
> Do you have an actual implementation that segfaults? Does the above
clarify
> what I meant?
> 
> >
> >
> >
> > [snip]
> >
> > > > In general, there are a large number of Matrix objects created and
> > > > destructed - there is, essentially, a necessity to use a custom
> > > > allocator to manage the allocation/deallocation of memory in the
> >
> > program.
> >
> > > Alright.
> > >
> > > > The first and naive attempt that I made (currently, it's all that
> > > > I've
> > > > done) is a Matrix_Data_Allocator class, which manages a memory pool.
> > > > [1]  The free_list is a static object in the allocator class, and
> > > > the allocate and deallocate functions are static functions.
> > > > Similarly, the mutex is also a static member of the allocator class.
> > >
> > > Ok. A possible optimization would be to either use thread local free
> > > lists
> >
> > or
> >
> > > lockfree/waitfree ones.
> >
> > If I understand correctly - thread-local would take more memory but
> > would completely eliminate contention and, therefore, the need for any
> > mutex at all.  Lockfree/waitfree would not necessarily use more
> > memory, but would prevent locking and improve performance during times
> of high contention.
> > Sound about right?
> 
> Right! And, for a matter of fact, thread-local and locality-local isn't
that far
> apart ;)
> 
> >
> > > > The obvious problem with this is that although it should work fine
> > > > with a single locality, it is clearly  going to cause segmentation
> > > > faults in a distributed app.  Although, from my understanding of
> > > > the serialization code in HPX, the  transfer of a Matrix from the
> > > > main locality to a remote locality to calculate the model fitness
> > > > does not use the Matrix allocator -- allocation is handled by the
> > > > serialization code, all other constructors/destructors will be a
problem.
> > >
> > > Well, what happens during serialization is that the data is copied
> > > over
> >
> > the
> >
> > > network and in the case of a container with dynamic size, you
> > > allocate
> >
> > your
> >
> > > memory and then copy the received data (inside of the archive) into
> > > the newly created objects.
> >
> > Sorry, I'm just a little stuck on your wording "you allocate your
> > memory and then copy...."
> 
> My fault ... I tend to personalize the code that gets executed ... so yes,
> serialize_buffer handles memory management for you, even in the case
> when you send it over the wire. It should always contain a correct buffer.
> 
> >
> > As I understand from the code, the allocation of memory for the
> > underlying serialize_buffer member is already defined in the
> > serialize_buffer class, and will use the Allocator type passed as a
> > template parameter to serialize_buffer<T, Allocator>.
> >
> > Consequently, in my own code, I've followed the 1d_stencil_8.cpp
> > example: I do not use my custom allocator as a template parameter for
> > the serialize_buffer -- rather, the allocation is done in my
> > Matrix_Data class and the pointer then passed to the serialize_buffer
> > with init_mode = bufferA_type::take, along with a pointer to the custom
> deallocator function.
> > The Matrix_Data constructor definition is:
> >
> > core::detail::Matrix_Data::Matrix_Data(int64_t elements) :
> > data_buffer_elements_ (elements), data_{
> > alloc_.allocate(data_buffer_elements_ * sizeof(data_type)),
> > static_cast<size_t>(data_buffer_elements_), buffer_type::take,
> > &Matrix_Data::deallocate} {}
> >
> > n.b. the class detail::Matrix_Data is a member of my Matrix class, and
> > it handles the memory management for the matrix.
> 
> That shouldn't matter at all. If the data is serialized to another
locality,
> serialize_buffer will eventually allocate new memory on the other locality
> using it's own, internal allocator. The pointer to the data is then, of
course,
> not obtained from your free list. That is, unless you instantiate
> serialize_buffer with your custom allocator.
> 
> >
> > > I don't think that creates any problems for you. The allocator you
> >
> > described
> >
> > > above, only carries global state (through the static variables). So
> > > the serialization of the allocator would essentially do nothing
> > > (Look at it as
> >
> > a tag
> >
> > > on which allocator to use). So when receiving a new serialize_buffer
> > > and deserializing it, you just allocate memory from the locality
> > > local free
> >
> > list (the
> >
> > > same should happen when deallocating the memory).
> >
> > I'm confused here by your wording:  "you just allocate memory from the
> > locality local free list".  Where did I get a locality local free
> > list?  The only thing I can think of is that I would just not include
> > the free list in the archive object used for serialization.  But I'm
> > not sure if this is your intent...
> 
> The static member of your allocator is a "locality local" free list.
> 
> >
> > If I'm completely mistaken here, then I'm hoping you might be able to
> > better clarify for me your intent.
> >
> > > > The most obvious way to work around the problem that comes to my
> > > > mind would be changing the free_list (and mutex) into a
> > > > std::map<std::uint32_t, free_list_type> (and
> > > > std::map<std::uint32_t,mutex>) so that each locality has a
> > > > separate mutex, but something about this seems to me to be wrong
> > > > -- it requires the allocator to be tightly-coupled with the HPX
> > > > runtime, so that the allocator can call hpx::get_locality:id() to
> > > > index the appropriate free_list.
> > >
> > > I don't think that is needed at all. static variables are not part
> > > of
> >
> > AGAS, they
> >
> > > are local to your process.
> >
> > I realize that the static variables are not part of the AGAS -- in
> > fact, that's exactly what's causing me the confusion here (at least in
> > my mind...).  To be slightly more specific, the issue in my mind isn't
> > the static variable itself, but what is contained within the static
> > variable -- i.e. pointers which are valid only within a particular
> > locality's address space.
> 
> Right, which you shouldn't send over the wire and try to use it in a
different
> address space to dereference the memory pointed to ;)
> 
> >
> > > > Similarly, the Model class (injected into the Model_Driver
> > > > component)
> > > > -- in which is where a large proportion of the Matrix allocations
> > > > occurs -- also presently is not coupled at all to the HPX runtime.
> > > > Although, conceivably, Model_Driver could provide a locality_id to
> > > > the Model class (to then pass along to a Matrix?). Although my
> > > > first inclination is that a Matrix class should not have knowledge
> > > > of the [distributed] architecture on which it runs, perhaps where
> > > > dealing with a distributed program architecture, it is necessary
> > > > to create distributed-type classes
> > > > -- i.e. something like class Distributed_Matrix : public Matrix
> > > > {..}; explictly Having said that, those are merely some
> > > > speculations which came to mind while trying to organize my
> > > > thoughts and present this question.  It is still remains in mind,
> > > > however, unclear.  Something tells me that there must be a better
> > > > way to deal with this. Hopefully, people with more brains and
> > > > experience can provide me with some insight and guidance.
> > >
> > > I hope the description above sheds some light on it, the matrix
> > > class
> >
> > doesn't
> >
> > > need any locality information, unless you want to create a truly
> >
> > distributed
> >
> > > data structure (as opposed to just a regular container that is sent
> > > over
> >
> > the
> >
> > > wire).
> >
> > I don't think that is what I want to do...  It would seem to me that a
> > Matrix class should be completely agnostic (or at least as completely
> > as
> > possible) of the environment in which it is used.
> >
> > > > I would greatly appreciate any suggestions that you can offer.  If
> > > > you require further details of my code, please let me know and I'd
> > > > be more than happy to elaborate further. However, I think that the
> > > > problem itself is fairly generic and is relevant to most code
> > > > which is written for a distributed environment - especially where
> > > > the parallelism isn't handled explicitly in the code (as opposed
> > > > to an MPI program, for example, where this is far more
> straightforward).
> > > >
> > > > Thanks and best regards,
> > > > Shmuel Levine
> > > >
> > > >
> > > > [1] The actual code is slightly more complicated than the above
> > > > description, although I don't think that it changes the nature of
> > > > the question or the appropriate solution signifcantly.  In
> > > > particular, each set of parameters is typically a
> > > > std::vector<Matrix>, where each Matrix is a different size. In
> > > > other words, the code uses multiple matrix sizes, although the
> > > > number of different sizes is constrained to the dimension of the
> > > > parameter vector above.  The actual allocator definition is as
follows:
> > > >
> > > > class Matrix_Allocator {
> > > >
> > > > public:
> > > >    using T = float;
> > > >    using data_type = T;
> > > >    static const int64_t alignment = 64;
> > > >
> > > > private:
> > > >    using mutex_type = hpx::lcos::local::spinlock;
> > > >    using free_list_type = std::map<int64_t, std::stack<T *>>;
> > > >    using allocation_list_type = std::map<T *, int64_t>;
> > > >
> > > > public:
> > > >    Matrix_Allocator() {}
> > > >    ~Matrix_Allocator();
> > > >    Matrix_Allocator(Matrix_Allocator const &) = delete;
> > > >    Matrix_Allocator(Matrix_Allocator &&) = delete;
> > > >
> > > >    static T *allocate(int64_t n);
> > > >    static void deallocate(T *p);
> > > >
> > > > private:
> > > >    static mutex_type mtx_;
> > > >    static free_list_type free_list_;
> > > >    static allocation_list_type allocation_list_;
> > > >
> > > > }; // class Matrix_Allocator
> > > >
> > > > The allocation_list_ is used to track the allocated size of a
> > > > given pointer, to determine to which free_list should the pointer
> > > > be added upon destruction of a matrix.
> > > >
> > > > _______________________________________________
> > > > hpx-users mailing list
> > > > [email protected]
> > > > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
> > >
> > > --
> > > Thomas Heller
> > > Friedrich-Alexander-Universität Erlangen-Nürnberg Department
> > > Informatik - Lehrstuhl Rechnerarchitektur Martensstr. 3
> > > 91058 Erlangen
> > > Tel.: 09131/85-27018
> > > Fax:  09131/85-27912
> > > Email: [email protected]
> > > _______________________________________________
> > > hpx-users mailing list
> > > [email protected]
> > > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
> >
> > _______________________________________________
> > hpx-users mailing list
> > [email protected]
> > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
> 
> 
> _______________________________________________
> hpx-users mailing list
> [email protected]
> https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] Memory management in a distributed app

Reply via email to