Richard Graham wrote:
Got it. I'm totally fine with that. Separate cachelines. and put the memory close to the process that is using it.Problematic concept, but ... okay, I'll read on. When the code first went in, there was no explicit memory affinity implemented, so first-touch was relied on to get the memory in the “correct” location.Okay. If I remember correctly, the head and the tail each are written to be a different process, and is where the pointers and counters used to manage the fifo are maintained. They need to be close to the writer, and on separate cache lines, to avoid false sharing.Why close to the writer (versus reader)? Anyhow, so far as I can tell, the 2d structure ompi_fifo_t fifo[receiver][sender] is organized by receiver. That is, the main ompi_fifo_t FIFO data structures are local to receivers. But then, each FIFO is initialized (that is, circular buffers and associated allocations) by senders. E.g., https://svn.open-mpi.org/trac/ompi/browser/branches/v1.3/ompi/mca/btl/sm/btl_sm.c?version=19785#L537 In the call to ompi_fifo_init(), all the circular buffer (CB) data structures are allocated by the sender. On different cachelines -- even different pages -- but all by the sender. Specifically, one accesses FIFO on the receiver side then follow pointers to the senders side. Doesn't matter if you're talking head, tail, or queue. The queue itself is accessed most often by the reader,You mean because it's polling often, but writer writes only once? so it should be closer to the reader.Are there measurements to substantiate this? Seems to me that in a cache-based system, a reader could poll on a remote location all it wanted and there'd be traffic only if the cached copy were invalidated. Conceivably, a transfer could go cache-to-cache and not hit memory at all. I tried some measurements and found no difference for any location -- close to writer, close to reader, or far from both. I honestly don’t remember much about the wrapper – would have to go back to the code to look at it. If we no longer allow multiple fifo per pair, the wrapper layer can go away – it is there to manage multiple fifo’s per pair.There is support for multiple circular buffers per FIFO. As far as granularity of allocation – it needs to be large enough to accommodate the smallest shared memory hierarchy, so I suppose in the most general case this may be the tertiary cache ?I don't get this. I understand how certain things should be on separate cachelines. Beyond that, we just figure out what should be local to a process and allocate all those things together. That takes us from 3*n*n allocations (and pages) to just n of them. No reason not to allocate objects that need to be associated with the same process on the same page, as long as one avoids false sharing.Got it. So seems like each process could have all of it’s receive fifo’s on the same page, and these could share the also with either the heads, or the tails of each queue.I will propose some specifics and run them by y'all. I think I know enough to get started. Thanks for the comments. |
- [OMPI devel] shared-memory allocations Eugene Loh
- Re: [OMPI devel] shared-memory allocations Jeff Squyres
- Re: [OMPI devel] shared-memory allocations Richard Graham
- Re: [OMPI devel] shared-memory allocations Eugene Loh
- Re: [OMPI devel] shared-memory allocations Richard Graham
- Re: [OMPI devel] shared-memory allocatio... Patrick Geoffray
- Re: [OMPI devel] shared-memory allo... Paul H. Hargrove
- Re: [OMPI devel] shared-memory allocatio... Eugene Loh