On 7/18/13 7:39 PM, "Ralph Castain" 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:

We are looking at exascale requirements, and one of the big issues is memory 
footprint. We currently retrieve the endpoint info for every process in the 
job, plus all the procs in any communicator with which we do a connect/accept - 
even though we probably will only communicate with a small number of them. This 
wastes a lot of memory at scale.

As long as we are re-working the endpoint stuff, would it be a thought to go 
ahead and change how we handle the above? I'm looking to switch to a lazy 
definition approach where we compute endpoints for procs on first-message 
instead of during mpi_init, retrieving the endpoint info for that proc only at 
that time. So instead of storing all the endpoint info for every proc in each 
proc, each proc only would contain the info it requires for that application.

It depends on what you mean by endpoint information.  If you mean what I call 
endpoint information (the stuff the PML/MTL/BML stores on an ompi_proc_t), then 
I really don't care.  For Portals, the endpoint information is quite small 
(8-16 bytes, depending on addressing mode), so I'd rather pre-populate the 
array and not slow down the send path with yet another conditional than have to 
check for endpoint data.  Of course, given the Portals usage model, I'd really 
like to jam the endpoint data into shared memory at some point (not this 
patch).  If others want to figure out how to do lazy endpoint data setup for 
their network, I think that's reasonable.

Ideally, I'd like to see that extended to the ompi_proc_t array itself - maybe 
changing it to a sparse array/list of some type, so we only create that storage 
for procs we actually communicate to.

This would actually break a whole lot of things in OMPI and is a huge change.  
However, I still have plans to add a --enable-minimal-memory type option some 
day which will make the ompi_proc_t significantly smaller by assuming 
homogeneous convertors and that you can programmatically get a remote host name 
when needed.  Again, unless we need to get micro-small (and I don't think we 
do), the sparseness requires conditionals in the critical path that worry me.

If you'd prefer to discuss this as a separate issue, that's fine - just 
something we need to work on at some point in the next year or two.

I agree some work is needed, but I think it's orthogonal to this issue and is 
something we're going to need to study in detail.  There are a number of 
space/time tradeoffs in that path.  Which isn't a problem, but there's a whole 
lot of low hanging fruit before we get to the hard stuff.  Now if you want the 
OFED interfaces to run at exascale, well, buy lots of memory.

Brian

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories

Reply via email to