On Montag, 29. August 2016 14:29:32 CEST Michael Levine wrote:
> Hi Thomas,
>
> > -----Original Message-----
> > From: [email protected] [mailto:hpx-users-
> > [email protected]] On Behalf Of Thomas Heller
> > Sent: August-29-16 6:11 AM
> > To: [email protected]
> > Subject: Re: [hpx-users] Memory management in a distributed app
> >
> > Hi,
> >
> > On 08/28/2016 06:06 PM, Shmuel Levine wrote:
> > > Hi All,
> > >
> > > I've finally found a bit of time once again to work on my hobby
> > > project with HPX... The long break actually gave me a fresh
> > > perspective on my own code, and it occurred to me that my code has
> > > some serious issues with memory management, and I'm hoping that
> > > someone can help to provide me with some better insight into how to
> > > best handle memory management while working in a distributed app. In
> > > particular, I would greatly appreciate some specific guidance on how
> > > to address the issue in my own code, since I'm at a bit of a loss here
> >
> > Let me try to answer your question. I am not sure I understood everything
> > correctly though...
>
> Thanks for your thorough message. I, as well, have a few questions on your
> message.
>
> However, let me preface two important points. Firstly, I don't have either
> academic or professional background in CS - I'm basically self-taught. So I
> might be somewhat naïve or unsophisticated in understanding of some areas.
> I apologize if this leads to any confusion. Secondly, I think it might be
> useful for me to start by clarifying the unstated assumption(s) in my last
> message about the potential problems that I had thought would be an issue.
>
> \begin {assumptions}
> Firstly - here's how I had visualized the concept of memory management in
> this distributed system:
>
> To start with, I considered the case of a single locality. There are a few
> variables stored on the stack, a free_list and mutex in static variable
> storage space, and a number of pointers to memory chunks on the heap. The
> pointers themselves are provided by the local machine's allocator and refer
> specifically to the process's address space. Given the list of pointers to
> allocated memory P = {p1, p2, ..., pn}, every pointer in this case is valid
> and can be accessed freely. Incidentally, there shouldn't even be a
> segmentation fault associated with (mis-)using memory that has already been
> allocated, since from the OS's point-of-view, the process is using an
> allocated region of memory within the process's address space.
>
> Next, I considered what would happen with separate processes, let's call
> them L0 and L1. If I understand correctly, it wouldn't matter whether these
> are on separate machines or on a single machine.
> Let's say on process 0, I instantiate a bunch of Matrix objects. These use
> allocated memory segments at P = {p1, p2, ..., pn}, as before. For this
> hypothetical example, I've also finished with some of those Matrix objects
> so that my free list on process 0 -- call it F_0 -- contains P', which is a
> subset of P. Next, I pass a Matrix to an action on a component residing on
> process 1. Again, the main assumption here is:
> - I had assumed that the static free list would also be copied to locality
> 1, so that F_1 == F_0, and both contain the same list P'.
>
> Now, the code running on L1 calls a Matrix constructor with the same static
> allocator containing list F_1. As mentioned in my above assumptions, F_1
> contains pointers P' -- all of which are pointers to memory allocated within
> L0 process memory space. Considering the first pointer p' on the
> free_list, on L1, the address pointed-to by p' was not allocated within
> L1's address space. As such, I assume that any access of this space would
> cause a segmentation fault.
>
> As a final underlying assumption -- following my earlier understanding that
> the runtime system handles the allocation of the new matrix when, when the
> matrix is de-serialized on L1 (let's call the Matrix on L1 - m1) : m1's data
> is deserialized into p'', which is a pointer allocated within L1's address
> space. When m1 goes out of scope, p'' can be added to F_1 without a
> problem. Another matrix on L1 -- say m1' can safely grab p''.
>
> \end {assumptions}
please take a look at this pseudo code:
Please take a look here:
template <typename T>
struct allocator
{
T *allocate(std::size_t count)
{
return free_list_.get(count);
}
void deallocate(T* p, std::size_t count)
{
free_list_.push(p count);
}
static free_list<T> free_list_;
template <typename Archive>
void serialize(Archive& ar, unsigned)
{
// This is empty, we don't really maintain state.
}
};
void f(serialize_buffer<float, allocator<float> > buf)
{
}
HPX_PLAIN_ACTION(f)
void g()
{
// Crate a serialization buffer object.
serialize_buffer<float, allocator<float> > buf(100);
// We now have 100 floats allocated using our allocator.
//
// The source locality (L0) has it's own static free_list<T> F_0, which
might
// contain various entries.
//
// The memory allocated for buf is now pointed to by the valid pointer
P_0.
//
// We want to call f on another locality...
id_type there = ...;
// Buf will now get sent to 'there'. What will happen is that we now copy
// the content of buf over the network to 'there'.
// Once 'there' received this message (we call that message parcel), it
needs
// to desiarlize it. In order to do that, it needs to allocate memory for
// 100 floats, using our allocator with its own process-private free list.
// There is no need that the two localities need to share the pointers in
the
// free list. Compare it to if you have a thread-local free list.
f_action()(there, buf);
}
>
> I did not mean to ever suggest in my previous message that my design was
> _correct_. On the contrary - it causes segmentation faults - and I was
> hoping for some clarification as to how to properly handle this problem.
Do you have an actual implementation that segfaults? Does the above clarify
what I meant?
>
>
>
> [snip]
>
> > > In general, there are a large number of Matrix objects created and
> > > destructed - there is, essentially, a necessity to use a custom
> > > allocator to manage the allocation/deallocation of memory in the
>
> program.
>
> > Alright.
> >
> > > The first and naive attempt that I made (currently, it's all that I've
> > > done) is a Matrix_Data_Allocator class, which manages a memory pool.
> > > [1] The free_list is a static object in the allocator class, and the
> > > allocate and deallocate functions are static functions. Similarly, the
> > > mutex is also a static member of the allocator class.
> >
> > Ok. A possible optimization would be to either use thread local free lists
>
> or
>
> > lockfree/waitfree ones.
>
> If I understand correctly - thread-local would take more memory but would
> completely eliminate contention and, therefore, the need for any mutex at
> all. Lockfree/waitfree would not necessarily use more memory, but would
> prevent locking and improve performance during times of high contention.
> Sound about right?
Right! And, for a matter of fact, thread-local and locality-local isn't that
far apart ;)
>
> > > The obvious problem with this is that although it should work fine
> > > with a single locality, it is clearly going to cause segmentation
> > > faults in a distributed app. Although, from my understanding of the
> > > serialization code in HPX, the transfer of a Matrix from the main
> > > locality to a remote locality to calculate the model fitness does not
> > > use the Matrix allocator -- allocation is handled by the serialization
> > > code, all other constructors/destructors will be a problem.
> >
> > Well, what happens during serialization is that the data is copied over
>
> the
>
> > network and in the case of a container with dynamic size, you allocate
>
> your
>
> > memory and then copy the received data (inside of the archive) into the
> > newly created objects.
>
> Sorry, I'm just a little stuck on your wording "you allocate your memory and
> then copy...."
My fault ... I tend to personalize the code that gets executed ... so yes,
serialize_buffer handles memory management for you, even in the case when you
send it over the wire. It should always contain a correct buffer.
>
> As I understand from the code, the allocation of memory for the underlying
> serialize_buffer member is already defined in the serialize_buffer class,
> and will use the Allocator type passed as a template parameter to
> serialize_buffer<T, Allocator>.
>
> Consequently, in my own code, I've followed the 1d_stencil_8.cpp example: I
> do not use my custom allocator as a template parameter for the
> serialize_buffer -- rather, the allocation is done in my Matrix_Data class
> and the pointer then passed to the serialize_buffer with init_mode =
> bufferA_type::take, along with a pointer to the custom deallocator function.
> The Matrix_Data constructor definition is:
>
> core::detail::Matrix_Data::Matrix_Data(int64_t elements) :
> data_buffer_elements_ (elements), data_{
> alloc_.allocate(data_buffer_elements_ * sizeof(data_type)),
> static_cast<size_t>(data_buffer_elements_), buffer_type::take,
> &Matrix_Data::deallocate} {}
>
> n.b. the class detail::Matrix_Data is a member of my Matrix class, and it
> handles the memory management for the matrix.
That shouldn't matter at all. If the data is serialized to another locality,
serialize_buffer will eventually allocate new memory on the other locality
using it's own, internal allocator. The pointer to the data is then, of
course, not obtained from your free list. That is, unless you instantiate
serialize_buffer with your custom allocator.
>
> > I don't think that creates any problems for you. The allocator you
>
> described
>
> > above, only carries global state (through the static variables). So the
> > serialization of the allocator would essentially do nothing (Look at it as
>
> a tag
>
> > on which allocator to use). So when receiving a new serialize_buffer and
> > deserializing it, you just allocate memory from the locality local free
>
> list (the
>
> > same should happen when deallocating the memory).
>
> I'm confused here by your wording: "you just allocate memory from the
> locality local free list". Where did I get a locality local free list? The
> only thing I can think of is that I would just not include the free list in
> the archive object used for serialization. But I'm not sure if this is
> your intent...
The static member of your allocator is a "locality local" free list.
>
> If I'm completely mistaken here, then I'm hoping you might be able to better
> clarify for me your intent.
>
> > > The most obvious way to work around the problem that comes to my mind
> > > would be changing the free_list (and mutex) into a
> > > std::map<std::uint32_t, free_list_type> (and
> > > std::map<std::uint32_t,mutex>) so that each locality has a separate
> > > mutex, but something about this seems to me to be wrong -- it requires
> > > the allocator to be tightly-coupled with the HPX runtime, so that the
> > > allocator can call hpx::get_locality:id() to index the appropriate
> > > free_list.
> >
> > I don't think that is needed at all. static variables are not part of
>
> AGAS, they
>
> > are local to your process.
>
> I realize that the static variables are not part of the AGAS -- in fact,
> that's exactly what's causing me the confusion here (at least in my
> mind...). To be slightly more specific, the issue in my mind isn't the
> static variable itself, but what is contained within the static variable --
> i.e. pointers which are valid only within a particular locality's address
> space.
Right, which you shouldn't send over the wire and try to use it in a different
address space to dereference the memory pointed to ;)
>
> > > Similarly, the Model class (injected into the Model_Driver component)
> > > -- in which is where a large proportion of the Matrix allocations
> > > occurs -- also presently is not coupled at all to the HPX runtime.
> > > Although, conceivably, Model_Driver could provide a locality_id to the
> > > Model class (to then pass along to a Matrix?). Although my first
> > > inclination is that a Matrix class should not have knowledge of the
> > > [distributed] architecture on which it runs, perhaps where dealing
> > > with a distributed program architecture, it is necessary to create
> > > distributed-type classes
> > > -- i.e. something like class Distributed_Matrix : public Matrix {..};
> > > explictly Having said that, those are merely some speculations which
> > > came to mind while trying to organize my thoughts and present this
> > > question. It is still remains in mind, however, unclear. Something
> > > tells me that there must be a better way to deal with this. Hopefully,
> > > people with more brains and experience can provide me with some
> > > insight and guidance.
> >
> > I hope the description above sheds some light on it, the matrix class
>
> doesn't
>
> > need any locality information, unless you want to create a truly
>
> distributed
>
> > data structure (as opposed to just a regular container that is sent over
>
> the
>
> > wire).
>
> I don't think that is what I want to do... It would seem to me that a
> Matrix class should be completely agnostic (or at least as completely as
> possible) of the environment in which it is used.
>
> > > I would greatly appreciate any suggestions that you can offer. If you
> > > require further details of my code, please let me know and I'd be more
> > > than happy to elaborate further. However, I think that the problem
> > > itself is fairly generic and is relevant to most code which is written
> > > for a distributed environment - especially where the parallelism isn't
> > > handled explicitly in the code (as opposed to an MPI program, for
> > > example, where this is far more straightforward).
> > >
> > > Thanks and best regards,
> > > Shmuel Levine
> > >
> > >
> > > [1] The actual code is slightly more complicated than the above
> > > description, although I don't think that it changes the nature of the
> > > question or the appropriate solution signifcantly. In particular,
> > > each set of parameters is typically a std::vector<Matrix>, where each
> > > Matrix is a different size. In other words, the code uses multiple
> > > matrix sizes, although the number of different sizes is constrained to
> > > the dimension of the parameter vector above. The actual allocator
> > > definition is as follows:
> > >
> > > class Matrix_Allocator {
> > >
> > > public:
> > > using T = float;
> > > using data_type = T;
> > > static const int64_t alignment = 64;
> > >
> > > private:
> > > using mutex_type = hpx::lcos::local::spinlock;
> > > using free_list_type = std::map<int64_t, std::stack<T *>>;
> > > using allocation_list_type = std::map<T *, int64_t>;
> > >
> > > public:
> > > Matrix_Allocator() {}
> > > ~Matrix_Allocator();
> > > Matrix_Allocator(Matrix_Allocator const &) = delete;
> > > Matrix_Allocator(Matrix_Allocator &&) = delete;
> > >
> > > static T *allocate(int64_t n);
> > > static void deallocate(T *p);
> > >
> > > private:
> > > static mutex_type mtx_;
> > > static free_list_type free_list_;
> > > static allocation_list_type allocation_list_;
> > >
> > > }; // class Matrix_Allocator
> > >
> > > The allocation_list_ is used to track the allocated size of a given
> > > pointer, to determine to which free_list should the pointer be added
> > > upon destruction of a matrix.
> > >
> > > _______________________________________________
> > > hpx-users mailing list
> > > [email protected]
> > > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
> >
> > --
> > Thomas Heller
> > Friedrich-Alexander-Universität Erlangen-Nürnberg Department Informatik -
> > Lehrstuhl Rechnerarchitektur Martensstr. 3
> > 91058 Erlangen
> > Tel.: 09131/85-27018
> > Fax: 09131/85-27912
> > Email: [email protected]
> > _______________________________________________
> > hpx-users mailing list
> > [email protected]
> > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
>
> _______________________________________________
> hpx-users mailing list
> [email protected]
> https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users