Am 29.08.2016 10:04 nachm. schrieb "Michael Levine" <[email protected] >: > > So, if I've finally understood what you're telling me- > > Class A{ > static int i_; > }; > > Class B{ > static A a_; > /* various state */ > }; > > Locality 0 keeps a version of B::a_ in the process's static variable memory > space. > Locality 1 keeps a completely separate version of B::a_ in its own process > static variable memory space. > > Any object of type B constructed on locality 0 - either in a locally-called > function or through a message deserialized on locality 0 - will always refer > to the same static object B::a_. > Any object of type B constructed on locality 1 - either in a locally-called > function or through a message deserialized on locality 1 - will always refer > to the same static object B::a_. > > In other words, by default, any global object that is not explicitly > synchronized somehow , will be locality-local. > > Does that sound about right?
Yes, exactly! Conceptually, the only difference between a regular global and one in a class is the scope and potential access control (private, public etc). So it doesn't really belong to a single object's state. > > > > -----Original Message----- > > From: [email protected] [mailto:hpx-users- > > [email protected]] On Behalf Of Thomas Heller > > Sent: August-29-16 3:32 PM > > To: [email protected] > > Subject: Re: [hpx-users] Memory management in a distributed app > > > > On Montag, 29. August 2016 14:29:32 CEST Michael Levine wrote: > > > Hi Thomas, > > > > > > > -----Original Message----- > > > > From: [email protected] [mailto:hpx-users- > > > > [email protected]] On Behalf Of Thomas Heller > > > > Sent: August-29-16 6:11 AM > > > > To: [email protected] > > > > Subject: Re: [hpx-users] Memory management in a distributed app > > > > > > > > Hi, > > > > > > > > On 08/28/2016 06:06 PM, Shmuel Levine wrote: > > > > > Hi All, > > > > > > > > > > I've finally found a bit of time once again to work on my hobby > > > > > project with HPX... The long break actually gave me a fresh > > > > > perspective on my own code, and it occurred to me that my code has > > > > > some serious issues with memory management, and I'm hoping that > > > > > someone can help to provide me with some better insight into how > > > > > to best handle memory management while working in a distributed > > > > > app. In particular, I would greatly appreciate some specific > > > > > guidance on how to address the issue in my own code, since I'm at > > > > > a bit of a loss here > > > > > > > > Let me try to answer your question. I am not sure I understood > > > > everything correctly though... > > > > > > Thanks for your thorough message. I, as well, have a few questions on > > > your message. > > > > > > However, let me preface two important points. Firstly, I don't have > > > either academic or professional background in CS - I'm basically > > > self-taught. So I might be somewhat naïve or unsophisticated in > > understanding of some areas. > > > I apologize if this leads to any confusion. Secondly, I think it > > > might be useful for me to start by clarifying the unstated > > > assumption(s) in my last message about the potential problems that I had > > thought would be an issue. > > > > > > \begin {assumptions} > > > Firstly - here's how I had visualized the concept of memory management > > > in this distributed system: > > > > > > To start with, I considered the case of a single locality. There are > > > a few variables stored on the stack, a free_list and mutex in static > > > variable storage space, and a number of pointers to memory chunks on > > > the heap. The pointers themselves are provided by the local machine's > > > allocator and refer specifically to the process's address space. > > > Given the list of pointers to allocated memory P = {p1, p2, ..., pn}, > > > every pointer in this case is valid and can be accessed freely. > > > Incidentally, there shouldn't even be a segmentation fault associated > > > with (mis-)using memory that has already been allocated, since from > > > the OS's point-of-view, the process is using an allocated region of > memory > > within the process's address space. > > > > > > Next, I considered what would happen with separate processes, let's > > > call them L0 and L1. If I understand correctly, it wouldn't matter > > > whether these are on separate machines or on a single machine. > > > Let's say on process 0, I instantiate a bunch of Matrix objects. > > > These use allocated memory segments at P = {p1, p2, ..., pn}, as > > > before. For this hypothetical example, I've also finished with some > > > of those Matrix objects so that my free list on process 0 -- call it > > > F_0 -- contains P', which is a subset of P. Next, I pass a Matrix to > > > an action on a component residing on process 1. Again, the main > > assumption here is: > > > - I had assumed that the static free list would also be copied to > > > locality 1, so that F_1 == F_0, and both contain the same list P'. > > > > > > Now, the code running on L1 calls a Matrix constructor with the same > > > static allocator containing list F_1. As mentioned in my above > > > assumptions, F_1 contains pointers P' -- all of which are pointers to > > > memory allocated within > > > L0 process memory space. Considering the first pointer p' on the > > > free_list, on L1, the address pointed-to by p' was not allocated > > > within L1's address space. As such, I assume that any access of this > > > space would cause a segmentation fault. > > > > > > As a final underlying assumption -- following my earlier understanding > > > that the runtime system handles the allocation of the new matrix when, > > > when the matrix is de-serialized on L1 (let's call the Matrix on L1 - > > > m1) : m1's data is deserialized into p'', which is a pointer allocated > > > within L1's address space. When m1 goes out of scope, p'' can be > > > added to F_1 without a problem. Another matrix on L1 -- say m1' can > > safely grab p''. > > > > > > \end {assumptions} > > > > please take a look at this pseudo code: > > Please take a look here: > > template <typename T> > > struct allocator > > { > > T *allocate(std::size_t count) > > { > > return free_list_.get(count); > > } > > > > void deallocate(T* p, std::size_t count) > > { > > free_list_.push(p count); > > } > > > > static free_list<T> free_list_; > > > > template <typename Archive> > > void serialize(Archive& ar, unsigned) > > { > > // This is empty, we don't really maintain state. > > } > > }; > > > > void f(serialize_buffer<float, allocator<float> > buf) { } > > > > HPX_PLAIN_ACTION(f) > > > > void g() > > { > > // Crate a serialization buffer object. > > serialize_buffer<float, allocator<float> > buf(100); > > // We now have 100 floats allocated using our allocator. > > // > > // The source locality (L0) has it's own static free_list<T> F_0, > which might > > // contain various entries. > > // > > // The memory allocated for buf is now pointed to by the valid pointer > P_0. > > // > > // We want to call f on another locality... > > id_type there = ...; > > // Buf will now get sent to 'there'. What will happen is that we now > copy > > // the content of buf over the network to 'there'. > > // Once 'there' received this message (we call that message parcel), > it > > needs > > // to desiarlize it. In order to do that, it needs to allocate memory > for > > // 100 floats, using our allocator with its own process-private free > list. > > // There is no need that the two localities need to share the pointers > in the > > // free list. Compare it to if you have a thread-local free list. > > f_action()(there, buf); > > } > > > > > > > > I did not mean to ever suggest in my previous message that my design > > > was _correct_. On the contrary - it causes segmentation faults - and > > > I was hoping for some clarification as to how to properly handle this > > problem. > > > > Do you have an actual implementation that segfaults? Does the above > clarify > > what I meant? > > > > > > > > > > > > > > [snip] > > > > > > > > In general, there are a large number of Matrix objects created and > > > > > destructed - there is, essentially, a necessity to use a custom > > > > > allocator to manage the allocation/deallocation of memory in the > > > > > > program. > > > > > > > Alright. > > > > > > > > > The first and naive attempt that I made (currently, it's all that > > > > > I've > > > > > done) is a Matrix_Data_Allocator class, which manages a memory pool. > > > > > [1] The free_list is a static object in the allocator class, and > > > > > the allocate and deallocate functions are static functions. > > > > > Similarly, the mutex is also a static member of the allocator class. > > > > > > > > Ok. A possible optimization would be to either use thread local free > > > > lists > > > > > > or > > > > > > > lockfree/waitfree ones. > > > > > > If I understand correctly - thread-local would take more memory but > > > would completely eliminate contention and, therefore, the need for any > > > mutex at all. Lockfree/waitfree would not necessarily use more > > > memory, but would prevent locking and improve performance during times > > of high contention. > > > Sound about right? > > > > Right! And, for a matter of fact, thread-local and locality-local isn't > that far > > apart ;) > > > > > > > > > > The obvious problem with this is that although it should work fine > > > > > with a single locality, it is clearly going to cause segmentation > > > > > faults in a distributed app. Although, from my understanding of > > > > > the serialization code in HPX, the transfer of a Matrix from the > > > > > main locality to a remote locality to calculate the model fitness > > > > > does not use the Matrix allocator -- allocation is handled by the > > > > > serialization code, all other constructors/destructors will be a > problem. > > > > > > > > Well, what happens during serialization is that the data is copied > > > > over > > > > > > the > > > > > > > network and in the case of a container with dynamic size, you > > > > allocate > > > > > > your > > > > > > > memory and then copy the received data (inside of the archive) into > > > > the newly created objects. > > > > > > Sorry, I'm just a little stuck on your wording "you allocate your > > > memory and then copy...." > > > > My fault ... I tend to personalize the code that gets executed ... so yes, > > serialize_buffer handles memory management for you, even in the case > > when you send it over the wire. It should always contain a correct buffer. > > > > > > > > As I understand from the code, the allocation of memory for the > > > underlying serialize_buffer member is already defined in the > > > serialize_buffer class, and will use the Allocator type passed as a > > > template parameter to serialize_buffer<T, Allocator>. > > > > > > Consequently, in my own code, I've followed the 1d_stencil_8.cpp > > > example: I do not use my custom allocator as a template parameter for > > > the serialize_buffer -- rather, the allocation is done in my > > > Matrix_Data class and the pointer then passed to the serialize_buffer > > > with init_mode = bufferA_type::take, along with a pointer to the custom > > deallocator function. > > > The Matrix_Data constructor definition is: > > > > > > core::detail::Matrix_Data::Matrix_Data(int64_t elements) : > > > data_buffer_elements_ (elements), data_{ > > > alloc_.allocate(data_buffer_elements_ * sizeof(data_type)), > > > static_cast<size_t>(data_buffer_elements_), buffer_type::take, > > > &Matrix_Data::deallocate} {} > > > > > > n.b. the class detail::Matrix_Data is a member of my Matrix class, and > > > it handles the memory management for the matrix. > > > > That shouldn't matter at all. If the data is serialized to another > locality, > > serialize_buffer will eventually allocate new memory on the other locality > > using it's own, internal allocator. The pointer to the data is then, of > course, > > not obtained from your free list. That is, unless you instantiate > > serialize_buffer with your custom allocator. > > > > > > > > > I don't think that creates any problems for you. The allocator you > > > > > > described > > > > > > > above, only carries global state (through the static variables). So > > > > the serialization of the allocator would essentially do nothing > > > > (Look at it as > > > > > > a tag > > > > > > > on which allocator to use). So when receiving a new serialize_buffer > > > > and deserializing it, you just allocate memory from the locality > > > > local free > > > > > > list (the > > > > > > > same should happen when deallocating the memory). > > > > > > I'm confused here by your wording: "you just allocate memory from the > > > locality local free list". Where did I get a locality local free > > > list? The only thing I can think of is that I would just not include > > > the free list in the archive object used for serialization. But I'm > > > not sure if this is your intent... > > > > The static member of your allocator is a "locality local" free list. > > > > > > > > If I'm completely mistaken here, then I'm hoping you might be able to > > > better clarify for me your intent. > > > > > > > > The most obvious way to work around the problem that comes to my > > > > > mind would be changing the free_list (and mutex) into a > > > > > std::map<std::uint32_t, free_list_type> (and > > > > > std::map<std::uint32_t,mutex>) so that each locality has a > > > > > separate mutex, but something about this seems to me to be wrong > > > > > -- it requires the allocator to be tightly-coupled with the HPX > > > > > runtime, so that the allocator can call hpx::get_locality:id() to > > > > > index the appropriate free_list. > > > > > > > > I don't think that is needed at all. static variables are not part > > > > of > > > > > > AGAS, they > > > > > > > are local to your process. > > > > > > I realize that the static variables are not part of the AGAS -- in > > > fact, that's exactly what's causing me the confusion here (at least in > > > my mind...). To be slightly more specific, the issue in my mind isn't > > > the static variable itself, but what is contained within the static > > > variable -- i.e. pointers which are valid only within a particular > > > locality's address space. > > > > Right, which you shouldn't send over the wire and try to use it in a > different > > address space to dereference the memory pointed to ;) > > > > > > > > > > Similarly, the Model class (injected into the Model_Driver > > > > > component) > > > > > -- in which is where a large proportion of the Matrix allocations > > > > > occurs -- also presently is not coupled at all to the HPX runtime. > > > > > Although, conceivably, Model_Driver could provide a locality_id to > > > > > the Model class (to then pass along to a Matrix?). Although my > > > > > first inclination is that a Matrix class should not have knowledge > > > > > of the [distributed] architecture on which it runs, perhaps where > > > > > dealing with a distributed program architecture, it is necessary > > > > > to create distributed-type classes > > > > > -- i.e. something like class Distributed_Matrix : public Matrix > > > > > {..}; explictly Having said that, those are merely some > > > > > speculations which came to mind while trying to organize my > > > > > thoughts and present this question. It is still remains in mind, > > > > > however, unclear. Something tells me that there must be a better > > > > > way to deal with this. Hopefully, people with more brains and > > > > > experience can provide me with some insight and guidance. > > > > > > > > I hope the description above sheds some light on it, the matrix > > > > class > > > > > > doesn't > > > > > > > need any locality information, unless you want to create a truly > > > > > > distributed > > > > > > > data structure (as opposed to just a regular container that is sent > > > > over > > > > > > the > > > > > > > wire). > > > > > > I don't think that is what I want to do... It would seem to me that a > > > Matrix class should be completely agnostic (or at least as completely > > > as > > > possible) of the environment in which it is used. > > > > > > > > I would greatly appreciate any suggestions that you can offer. If > > > > > you require further details of my code, please let me know and I'd > > > > > be more than happy to elaborate further. However, I think that the > > > > > problem itself is fairly generic and is relevant to most code > > > > > which is written for a distributed environment - especially where > > > > > the parallelism isn't handled explicitly in the code (as opposed > > > > > to an MPI program, for example, where this is far more > > straightforward). > > > > > > > > > > Thanks and best regards, > > > > > Shmuel Levine > > > > > > > > > > > > > > > [1] The actual code is slightly more complicated than the above > > > > > description, although I don't think that it changes the nature of > > > > > the question or the appropriate solution signifcantly. In > > > > > particular, each set of parameters is typically a > > > > > std::vector<Matrix>, where each Matrix is a different size. In > > > > > other words, the code uses multiple matrix sizes, although the > > > > > number of different sizes is constrained to the dimension of the > > > > > parameter vector above. The actual allocator definition is as > follows: > > > > > > > > > > class Matrix_Allocator { > > > > > > > > > > public: > > > > > using T = float; > > > > > using data_type = T; > > > > > static const int64_t alignment = 64; > > > > > > > > > > private: > > > > > using mutex_type = hpx::lcos::local::spinlock; > > > > > using free_list_type = std::map<int64_t, std::stack<T *>>; > > > > > using allocation_list_type = std::map<T *, int64_t>; > > > > > > > > > > public: > > > > > Matrix_Allocator() {} > > > > > ~Matrix_Allocator(); > > > > > Matrix_Allocator(Matrix_Allocator const &) = delete; > > > > > Matrix_Allocator(Matrix_Allocator &&) = delete; > > > > > > > > > > static T *allocate(int64_t n); > > > > > static void deallocate(T *p); > > > > > > > > > > private: > > > > > static mutex_type mtx_; > > > > > static free_list_type free_list_; > > > > > static allocation_list_type allocation_list_; > > > > > > > > > > }; // class Matrix_Allocator > > > > > > > > > > The allocation_list_ is used to track the allocated size of a > > > > > given pointer, to determine to which free_list should the pointer > > > > > be added upon destruction of a matrix. > > > > > > > > > > _______________________________________________ > > > > > hpx-users mailing list > > > > > [email protected] > > > > > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users > > > > > > > > -- > > > > Thomas Heller > > > > Friedrich-Alexander-Universität Erlangen-Nürnberg Department > > > > Informatik - Lehrstuhl Rechnerarchitektur Martensstr. 3 > > > > 91058 Erlangen > > > > Tel.: 09131/85-27018 > > > > Fax: 09131/85-27912 > > > > Email: [email protected] > > > > _______________________________________________ > > > > hpx-users mailing list > > > > [email protected] > > > > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users > > > > > > _______________________________________________ > > > hpx-users mailing list > > > [email protected] > > > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users > > > > > > _______________________________________________ > > hpx-users mailing list > > [email protected] > > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users > > _______________________________________________ > hpx-users mailing list > [email protected] > https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
_______________________________________________ hpx-users mailing list [email protected] https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
