Re: [hpx-users] Memory management in a distributed app

Thomas Heller Mon, 29 Aug 2016 14:20:40 -0700

Am 29.08.2016 10:04 nachm. schrieb "Michael Levine" <[email protected]
>:
>
> So, if I've finally understood what you're telling me-
>
> Class A{
> static int i_;
> };
>
> Class B{
> static A a_;
> /* various state */
> };
>
> Locality 0 keeps a version of B::a_ in the process's static variable
memory
> space.
> Locality 1 keeps a completely separate version of B::a_ in its own process
> static variable memory space.
>
> Any object of type B constructed on locality 0 - either in a
locally-called
> function or through a message deserialized on locality 0 - will always
refer
> to the same static object B::a_.
> Any object of type B constructed on locality 1 - either in a
locally-called
> function or through a message deserialized on locality 1 - will always
refer
> to the same static object B::a_.
>
> In other words, by default, any global object that is not explicitly
> synchronized somehow , will be locality-local.
>
> Does that sound about right?


Yes, exactly! Conceptually, the only difference between a regular global
and one in a class is the scope and potential access control (private,
public etc). So it doesn't really belong to a single object's state.

>
>
> > -----Original Message-----
> > From: [email protected] [mailto:hpx-users-
> > [email protected]] On Behalf Of Thomas Heller
> > Sent: August-29-16 3:32 PM
> > To: [email protected]
> > Subject: Re: [hpx-users] Memory management in a distributed app
> >
> > On Montag, 29. August 2016 14:29:32 CEST Michael Levine wrote:
> > > Hi Thomas,
> > >
> > > > -----Original Message-----
> > > > From: [email protected] [mailto:hpx-users-
> > > > [email protected]] On Behalf Of Thomas Heller
> > > > Sent: August-29-16 6:11 AM
> > > > To: [email protected]
> > > > Subject: Re: [hpx-users] Memory management in a distributed app
> > > >
> > > > Hi,
> > > >
> > > > On 08/28/2016 06:06 PM, Shmuel Levine wrote:
> > > > > Hi All,
> > > > >
> > > > > I've finally found a bit of time once again to work on my hobby
> > > > > project with HPX...  The long break actually gave me a fresh
> > > > > perspective on my own code, and it occurred to me that my code has
> > > > > some serious issues with memory management, and I'm hoping that
> > > > > someone can help to provide me with some better insight into how
> > > > > to best handle memory management while working in a distributed
> > > > > app.  In particular, I would greatly appreciate some specific
> > > > > guidance on how to address the issue in my own code, since I'm at
> > > > > a bit of a loss here
> > > >
> > > > Let me try to answer your question. I am not sure I understood
> > > > everything correctly though...
> > >
> > > Thanks for your thorough message.  I, as well, have a few questions on
> > > your message.
> > >
> > > However, let me preface two important points.  Firstly, I don't have
> > > either academic or professional background in CS - I'm basically
> > > self-taught.  So I might be somewhat naïve or unsophisticated in
> > understanding of some areas.
> > > I apologize if this leads to any confusion.  Secondly, I think it
> > > might be useful for me to start by clarifying the unstated
> > > assumption(s) in my last message about the potential problems that I
had
> > thought would be an issue.
> > >
> > > \begin {assumptions}
> > > Firstly - here's how I had visualized the concept of memory management
> > > in this distributed system:
> > >
> > > To start with, I considered the case of a single locality.  There are
> > > a few variables stored on the stack, a free_list and mutex in static
> > > variable storage space, and a number of pointers to memory chunks on
> > > the heap.  The pointers themselves are provided by the local machine's
> > > allocator and refer specifically to the process's address space.
> > > Given the list of pointers to allocated memory P = {p1, p2, ..., pn},
> > > every pointer in this case is valid and can be accessed freely.
> > > Incidentally, there shouldn't even be a segmentation fault associated
> > > with (mis-)using  memory that has already been allocated, since from
> > > the OS's point-of-view, the process is using an allocated region of
> memory
> > within the process's address space.
> > >
> > > Next, I considered what would happen with separate processes, let's
> > > call them L0 and L1.  If I understand correctly, it wouldn't matter
> > > whether these are on separate machines or on a single machine.
> > > Let's say on process 0, I instantiate a bunch of Matrix objects.
> > > These use allocated memory segments at P = {p1, p2, ..., pn}, as
> > > before.  For this hypothetical example, I've also finished with some
> > > of those Matrix objects so that my free list on process 0 -- call it
> > > F_0 -- contains P', which is a subset of P.  Next, I pass a Matrix to
> > > an action on a component residing on process 1.  Again, the main
> > assumption here is:
> > > - I had assumed that the static free list would also be copied to
> > > locality 1, so that F_1 == F_0, and both contain the same list P'.
> > >
> > > Now, the code running on L1 calls a Matrix constructor with the same
> > > static allocator containing list F_1.  As mentioned in my above
> > > assumptions, F_1 contains pointers P' -- all of which are pointers to
> > > memory allocated within
> > > L0 process memory space.  Considering the first pointer p' on the
> > > free_list, on L1, the address pointed-to by p' was not allocated
> > > within L1's address space.  As such, I assume that any access of this
> > > space would cause a segmentation fault.
> > >
> > > As a final underlying assumption -- following my earlier understanding
> > > that the runtime system handles the allocation of the new matrix when,
> > > when the matrix is de-serialized on L1 (let's call the Matrix on L1 -
> > > m1) : m1's data is deserialized into p'', which is a pointer allocated
> > > within L1's address space.  When m1 goes out of scope, p'' can be
> > > added to F_1 without a problem.  Another matrix on L1 -- say m1' can
> > safely grab p''.
> > >
> > > \end {assumptions}
> >
> > please take a look at this pseudo code:
> > Please take a look here:
> > template <typename T>
> > struct allocator
> > {
> >     T *allocate(std::size_t count)
> >     {
> >         return free_list_.get(count);
> >     }
> >
> >     void deallocate(T* p, std::size_t count)
> >     {
> >         free_list_.push(p count);
> >     }
> >
> >     static free_list<T> free_list_;
> >
> >     template <typename Archive>
> >     void serialize(Archive& ar, unsigned)
> >     {
> >         // This is empty, we don't really maintain state.
> >     }
> > };
> >
> > void f(serialize_buffer<float, allocator<float> > buf) { }
> >
> > HPX_PLAIN_ACTION(f)
> >
> > void g()
> > {
> >     // Crate a serialization buffer object.
> >     serialize_buffer<float, allocator<float> > buf(100);
> >     // We now have 100 floats allocated using our allocator.
> >     //
> >     // The source locality (L0) has it's own static free_list<T> F_0,
> which might
> >     // contain various entries.
> >     //
> >     // The memory allocated for buf is now pointed to by the valid
pointer
> P_0.
> >     //
> >     // We want to call f on another locality...
> >     id_type there = ...;
> >     // Buf will now get sent to 'there'. What will happen is that we now
> copy
> >     // the content of buf over the network to 'there'.
> >     // Once 'there' received this message (we call that message parcel),
> it
> > needs
> >     // to desiarlize it. In order to do that, it needs to allocate
memory
> for
> >     // 100 floats, using our allocator with its own process-private free
> list.
> >     // There is no need that the two localities need to share the
pointers
> in the
> >     // free list. Compare it to if you have a thread-local free list.
> >     f_action()(there, buf);
> > }
> >
> > >
> > > I did not mean to ever suggest in my previous message that my design
> > > was _correct_.  On the contrary - it causes segmentation faults - and
> > > I was hoping for some clarification as to how to properly handle this
> > problem.
> >
> > Do you have an actual implementation that segfaults? Does the above
> clarify
> > what I meant?
> >
> > >
> > >
> > >
> > > [snip]
> > >
> > > > > In general, there are a large number of Matrix objects created and
> > > > > destructed - there is, essentially, a necessity to use a custom
> > > > > allocator to manage the allocation/deallocation of memory in the
> > >
> > > program.
> > >
> > > > Alright.
> > > >
> > > > > The first and naive attempt that I made (currently, it's all that
> > > > > I've
> > > > > done) is a Matrix_Data_Allocator class, which manages a memory
pool.
> > > > > [1]  The free_list is a static object in the allocator class, and
> > > > > the allocate and deallocate functions are static functions.
> > > > > Similarly, the mutex is also a static member of the allocator
class.
> > > >
> > > > Ok. A possible optimization would be to either use thread local free
> > > > lists
> > >
> > > or
> > >
> > > > lockfree/waitfree ones.
> > >
> > > If I understand correctly - thread-local would take more memory but
> > > would completely eliminate contention and, therefore, the need for any
> > > mutex at all.  Lockfree/waitfree would not necessarily use more
> > > memory, but would prevent locking and improve performance during times
> > of high contention.
> > > Sound about right?
> >
> > Right! And, for a matter of fact, thread-local and locality-local isn't
> that far
> > apart ;)
> >
> > >
> > > > > The obvious problem with this is that although it should work fine
> > > > > with a single locality, it is clearly  going to cause segmentation
> > > > > faults in a distributed app.  Although, from my understanding of
> > > > > the serialization code in HPX, the  transfer of a Matrix from the
> > > > > main locality to a remote locality to calculate the model fitness
> > > > > does not use the Matrix allocator -- allocation is handled by the
> > > > > serialization code, all other constructors/destructors will be a
> problem.
> > > >
> > > > Well, what happens during serialization is that the data is copied
> > > > over
> > >
> > > the
> > >
> > > > network and in the case of a container with dynamic size, you
> > > > allocate
> > >
> > > your
> > >
> > > > memory and then copy the received data (inside of the archive) into
> > > > the newly created objects.
> > >
> > > Sorry, I'm just a little stuck on your wording "you allocate your
> > > memory and then copy...."
> >
> > My fault ... I tend to personalize the code that gets executed ... so
yes,
> > serialize_buffer handles memory management for you, even in the case
> > when you send it over the wire. It should always contain a correct
buffer.
> >
> > >
> > > As I understand from the code, the allocation of memory for the
> > > underlying serialize_buffer member is already defined in the
> > > serialize_buffer class, and will use the Allocator type passed as a
> > > template parameter to serialize_buffer<T, Allocator>.
> > >
> > > Consequently, in my own code, I've followed the 1d_stencil_8.cpp
> > > example: I do not use my custom allocator as a template parameter for
> > > the serialize_buffer -- rather, the allocation is done in my
> > > Matrix_Data class and the pointer then passed to the serialize_buffer
> > > with init_mode = bufferA_type::take, along with a pointer to the
custom
> > deallocator function.
> > > The Matrix_Data constructor definition is:
> > >
> > > core::detail::Matrix_Data::Matrix_Data(int64_t elements) :
> > > data_buffer_elements_ (elements), data_{
> > > alloc_.allocate(data_buffer_elements_ * sizeof(data_type)),
> > > static_cast<size_t>(data_buffer_elements_), buffer_type::take,
> > > &Matrix_Data::deallocate} {}
> > >
> > > n.b. the class detail::Matrix_Data is a member of my Matrix class, and
> > > it handles the memory management for the matrix.
> >
> > That shouldn't matter at all. If the data is serialized to another
> locality,
> > serialize_buffer will eventually allocate new memory on the other
locality
> > using it's own, internal allocator. The pointer to the data is then, of
> course,
> > not obtained from your free list. That is, unless you instantiate
> > serialize_buffer with your custom allocator.
> >
> > >
> > > > I don't think that creates any problems for you. The allocator you
> > >
> > > described
> > >
> > > > above, only carries global state (through the static variables). So
> > > > the serialization of the allocator would essentially do nothing
> > > > (Look at it as
> > >
> > > a tag
> > >
> > > > on which allocator to use). So when receiving a new serialize_buffer
> > > > and deserializing it, you just allocate memory from the locality
> > > > local free
> > >
> > > list (the
> > >
> > > > same should happen when deallocating the memory).
> > >
> > > I'm confused here by your wording:  "you just allocate memory from the
> > > locality local free list".  Where did I get a locality local free
> > > list?  The only thing I can think of is that I would just not include
> > > the free list in the archive object used for serialization.  But I'm
> > > not sure if this is your intent...
> >
> > The static member of your allocator is a "locality local" free list.
> >
> > >
> > > If I'm completely mistaken here, then I'm hoping you might be able to
> > > better clarify for me your intent.
> > >
> > > > > The most obvious way to work around the problem that comes to my
> > > > > mind would be changing the free_list (and mutex) into a
> > > > > std::map<std::uint32_t, free_list_type> (and
> > > > > std::map<std::uint32_t,mutex>) so that each locality has a
> > > > > separate mutex, but something about this seems to me to be wrong
> > > > > -- it requires the allocator to be tightly-coupled with the HPX
> > > > > runtime, so that the allocator can call hpx::get_locality:id() to
> > > > > index the appropriate free_list.
> > > >
> > > > I don't think that is needed at all. static variables are not part
> > > > of
> > >
> > > AGAS, they
> > >
> > > > are local to your process.
> > >
> > > I realize that the static variables are not part of the AGAS -- in
> > > fact, that's exactly what's causing me the confusion here (at least in
> > > my mind...).  To be slightly more specific, the issue in my mind isn't
> > > the static variable itself, but what is contained within the static
> > > variable -- i.e. pointers which are valid only within a particular
> > > locality's address space.
> >
> > Right, which you shouldn't send over the wire and try to use it in a
> different
> > address space to dereference the memory pointed to ;)
> >
> > >
> > > > > Similarly, the Model class (injected into the Model_Driver
> > > > > component)
> > > > > -- in which is where a large proportion of the Matrix allocations
> > > > > occurs -- also presently is not coupled at all to the HPX runtime.
> > > > > Although, conceivably, Model_Driver could provide a locality_id to
> > > > > the Model class (to then pass along to a Matrix?). Although my
> > > > > first inclination is that a Matrix class should not have knowledge
> > > > > of the [distributed] architecture on which it runs, perhaps where
> > > > > dealing with a distributed program architecture, it is necessary
> > > > > to create distributed-type classes
> > > > > -- i.e. something like class Distributed_Matrix : public Matrix
> > > > > {..}; explictly Having said that, those are merely some
> > > > > speculations which came to mind while trying to organize my
> > > > > thoughts and present this question.  It is still remains in mind,
> > > > > however, unclear.  Something tells me that there must be a better
> > > > > way to deal with this. Hopefully, people with more brains and
> > > > > experience can provide me with some insight and guidance.
> > > >
> > > > I hope the description above sheds some light on it, the matrix
> > > > class
> > >
> > > doesn't
> > >
> > > > need any locality information, unless you want to create a truly
> > >
> > > distributed
> > >
> > > > data structure (as opposed to just a regular container that is sent
> > > > over
> > >
> > > the
> > >
> > > > wire).
> > >
> > > I don't think that is what I want to do...  It would seem to me that a
> > > Matrix class should be completely agnostic (or at least as completely
> > > as
> > > possible) of the environment in which it is used.
> > >
> > > > > I would greatly appreciate any suggestions that you can offer.  If
> > > > > you require further details of my code, please let me know and I'd
> > > > > be more than happy to elaborate further. However, I think that the
> > > > > problem itself is fairly generic and is relevant to most code
> > > > > which is written for a distributed environment - especially where
> > > > > the parallelism isn't handled explicitly in the code (as opposed
> > > > > to an MPI program, for example, where this is far more
> > straightforward).
> > > > >
> > > > > Thanks and best regards,
> > > > > Shmuel Levine
> > > > >
> > > > >
> > > > > [1] The actual code is slightly more complicated than the above
> > > > > description, although I don't think that it changes the nature of
> > > > > the question or the appropriate solution signifcantly.  In
> > > > > particular, each set of parameters is typically a
> > > > > std::vector<Matrix>, where each Matrix is a different size. In
> > > > > other words, the code uses multiple matrix sizes, although the
> > > > > number of different sizes is constrained to the dimension of the
> > > > > parameter vector above.  The actual allocator definition is as
> follows:
> > > > >
> > > > > class Matrix_Allocator {
> > > > >
> > > > > public:
> > > > >    using T = float;
> > > > >    using data_type = T;
> > > > >    static const int64_t alignment = 64;
> > > > >
> > > > > private:
> > > > >    using mutex_type = hpx::lcos::local::spinlock;
> > > > >    using free_list_type = std::map<int64_t, std::stack<T *>>;
> > > > >    using allocation_list_type = std::map<T *, int64_t>;
> > > > >
> > > > > public:
> > > > >    Matrix_Allocator() {}
> > > > >    ~Matrix_Allocator();
> > > > >    Matrix_Allocator(Matrix_Allocator const &) = delete;
> > > > >    Matrix_Allocator(Matrix_Allocator &&) = delete;
> > > > >
> > > > >    static T *allocate(int64_t n);
> > > > >    static void deallocate(T *p);
> > > > >
> > > > > private:
> > > > >    static mutex_type mtx_;
> > > > >    static free_list_type free_list_;
> > > > >    static allocation_list_type allocation_list_;
> > > > >
> > > > > }; // class Matrix_Allocator
> > > > >
> > > > > The allocation_list_ is used to track the allocated size of a
> > > > > given pointer, to determine to which free_list should the pointer
> > > > > be added upon destruction of a matrix.
> > > > >
> > > > > _______________________________________________
> > > > > hpx-users mailing list
> > > > > [email protected]
> > > > > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
> > > >
> > > > --
> > > > Thomas Heller
> > > > Friedrich-Alexander-Universität Erlangen-Nürnberg Department
> > > > Informatik - Lehrstuhl Rechnerarchitektur Martensstr. 3
> > > > 91058 Erlangen
> > > > Tel.: 09131/85-27018
> > > > Fax:  09131/85-27912
> > > > Email: [email protected]
> > > > _______________________________________________
> > > > hpx-users mailing list
> > > > [email protected]
> > > > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
> > >
> > > _______________________________________________
> > > hpx-users mailing list
> > > [email protected]
> > > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
> >
> >
> > _______________________________________________
> > hpx-users mailing list
> > [email protected]
> > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
>
> _______________________________________________
> hpx-users mailing list
> [email protected]
> https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] Memory management in a distributed app

Reply via email to