On 3/8/2018 12:56 AM, Anatoly Burakov wrote:
This enables multiprocess synchronization for memory hotplug
requests at runtime (as opposed to initialization).

Basic workflow is the following. Primary process always does initial
mapping and unmapping, and secondary processes always follow primary
page map. Only one allocation request can be active at any one time.

When primary allocates memory, it ensures that all other processes
have allocated the same set of hugepages successfully, otherwise
any allocations made are being rolled back, and heap is freed back.
Heap is locked throughout the process, so no race conditions can
happen.

When primary frees memory, it frees the heap, deallocates affected
pages, and notifies other processes of deallocations. Since heap is
freed from that memory chunk, the area basically becomes invisible
to other processes even if they happen to fail to unmap that
specific set of pages, so it's completely safe to ignore results of
sync requests.

When secondary allocates memory, it does not do so by itself.
Instead, it sends a request to primary process to try and allocate
pages of specified size and on specified socket, such that a
specified heap allocation request could complete. Primary process
then sends all secondaries (including the requestor) a separate
notification of allocated pages, and expects all secondary
processes to report success before considering pages as "allocated".

Only after primary process ensures that all memory has been
successfully allocated in all secondary process, it will respond
positively to the initial request, and let secondary proceed with
the allocation. Since the heap now has memory that can satisfy
allocation request, and it was locked all this time (so no other
allocations could take place), secondary process will be able to
allocate memory from the heap.

When secondary frees memory, it hides pages to be deallocated from
the heap. Then, it sends a deallocation request to primary process,
so that it deallocates pages itself, and then sends a separate sync
request to all other processes (including the requestor) to unmap
the same pages. This way, even if secondary fails to notify other
processes of this deallocation, that memory will become invisible
to other processes, and will not be allocated from again.

So, to summarize: address space will only become part of the heap
if primary process can ensure that all other processes have
allocated this memory successfully. If anything goes wrong, the
worst thing that could happen is that a page will "leak" and will
not be available to neither DPDK nor the system, as some process
will still hold onto it. It's not an actual leak, as we can account
for the page - it's just that none of the processes will be able
to use this page for anything useful, until it gets allocated from
by the primary.

Due to underlying DPDK IPC implementation being single-threaded,
some asynchronous magic had to be done, as we need to complete
several requests before we can definitively allow secondary process
to use allocated memory (namely, it has to be present in all other
secondary processes before it can be used). Additionally, only
one allocation request is allowed to be submitted at once.

Memory allocation requests are only allowed when there are no
secondary processes currently initializing. To enforce that,
a shared rwlock is used, that is set to read lock on init (so that
several secondaries could initialize concurrently), and write lock
on making allocation requests (so that either secondary init will
have to wait, or allocation request will have to wait until all
processes have initialized).

Signed-off-by: Anatoly Burakov <anatoly.bura...@intel.com>
---

Notes:
     v2: - fixed deadlocking on init problem
         - reverted rte_panic changes (fixed by changes in IPC instead)
This problem is evidently complex to solve without multithreaded
     IPC implementation. An alternative approach would be to process
     each individual message in its own thread (or at least spawn a
     thread per incoming request) - that way, we can send requests
     while responding to another request, and this problem becomes
     trivial to solve (and in fact it was solved that way initially,
     before my aversion to certain other programming languages kicked
     in).
Is the added complexity worth saving a couple of thread spin-ups
     here and there?

  lib/librte_eal/bsdapp/eal/Makefile                |   1 +
  lib/librte_eal/common/eal_common_memory.c         |  16 +-
  lib/librte_eal/common/include/rte_eal_memconfig.h |   3 +
  lib/librte_eal/common/malloc_heap.c               | 255 ++++++--
  lib/librte_eal/common/malloc_mp.c                 | 723 ++++++++++++++++++++++
  lib/librte_eal/common/malloc_mp.h                 |  86 +++
  lib/librte_eal/common/meson.build                 |   1 +
  lib/librte_eal/linuxapp/eal/Makefile              |   1 +
  8 files changed, 1040 insertions(+), 46 deletions(-)
  create mode 100644 lib/librte_eal/common/malloc_mp.c
  create mode 100644 lib/librte_eal/common/malloc_mp.h
...
+/* callback for asynchronous sync requests for primary. this will either do a
+ * sendmsg with results, or trigger rollback request.
+ */
+static int
+handle_sync_response(const struct rte_mp_msg *request,

Rename to handle_async_response()?

Reply via email to