On Aug 29, 2008, at 5:52 PM, Eugene Loh wrote:
I'm looking at the sm BTL.
Excellent! I hope you had a good dash of parmesan with that spaghetti
code in there (the sm btl is among the hairiest sections in
OMPI...). :-)
In mca_btl_sm_add_procs(), there's a loop over peer processes, with
a call to ompi_fifo_init(). That is, one call to ompi_fifo_init()
for each connection
[snip]
on page boundaries.
I *believe* your analysis is correct. It's been a while since I've
looked in detail in that section of code, but what you say sounds
reasonable.
As the number of local processes increases, therefore these per-
connection allocations become very costly. For 8K pages, for
example, and 100 on-node processes, we're talking 3*100*100*8K = 240
Mbytes. For 512 on-node processes (yes, we have nodes this big),
that's 6 Gbyte... most of which is unused. (E.g., allocating more
than an 8K page when we only need 64 or 12 bytes.)
Okay, long intro. Let me start with a short question: do we really
need page alignment for these allocations? Would cacheline
alignment be okay?
I believe the main rationale for doing page-line alignments was for
memory affinity, since (at least on Linux, I don't know about solaris)
you can only affinity-ize pages.
On your big 512 proc machines, I'm assuming that the page memory
affinity will matter...?
That being said, we're certainly open to making things better. E.g.,
if a few procs share a memory locality (can you detect that in
Solaris?), have them share a page or somesuch...? (totally open to
ideas here)
--
Jeff Squyres
Cisco Systems