Hi Al,

Al Chu wrote:
On Mon, 2008-06-16 at 10:21 -0700, Al Chu wrote:
Hey Yevgeny,
On Sun, 2008-06-15 at 11:17 +0300, Yevgeny Kliteynik wrote:
Hi Al,

Al Chu wrote:
Hey Sasha,

This is a conceptually simple option I've developed for updn routing.

Currently in updn routing, nodes/guids are routed on switches in a
seemingly-random order, which I believe is due to internal data
structure organization (i.e. cl_qmap_apply_func is called on
port_guid_tbl) as well as how the fabric is scanned (it is logically
scanned from a port perspective, but it may not be logical from a node
perspective).  I had a hypothesis that this was leading to increased
contention in the network for MPI.

For example, suppose we have 12 uplinks from a leaf switch to a spine
switch.  If we want to send data from this leaf switch to node[13-24],
the up links we will send on are pretty random. It's because:

A) node[13-24] are individually routed at seemingly-random points based
on when they are called by cl_qmap_apply_func().

B) the ports chosen for routing are based on least used port usage.

C) least used port usage is based on whatever was routed earlier on.

So I developed this patch series, which supports an option called
"guid_routing_order_file" which allows the user to input a file with a
list of port_guids which will indicate the order in which guids are
routed instead (naturally, those guids not listed are routed last).
Great idea!
Thanks.

I understand that this guid_routing_order_file is synchronized with
an MPI rank file, right? If not, then synchronizing them might give
even better results.
Not quite sure what you mean by a MPI rank file.  At LLNL, slurm is
responsible for MPI ranks, so I order the guids in my file according to
how slurm is configured for chosing MPI ranks.  I will admit to being a
novice to MPI's configuration (blindly accepting slurm MPI rankings).
Is there an underlying file that MPI libs use for ranking knowledge?

I spoke to one of our MPI guys.  I wasn't aware that in some MPIs you
can input a file to tell it how ranks should be assigned to nodes for
MPI.  I assume that's what you're talking about?

Yes, that is what I was talking about.
There is a host file, where you list all the hosts that MPI should use,
and in some MPIs there is also a way to specify the order of MPI ranks
that would be assigned to processes (I'm not an MPI expert, so I'm not
sure about the terminology that I use).
I know that MVAPICH is using the host order when assigning ranks, so
the order of the cluster nodes listed in host file is important.
Not sure about OpenMPI.

Another idea: OpenSM can create such file (list, doesn't have to be
actual file) automatically, just by checking topologically-adjacent
leaf switches and their HCAs.
Definitely a good idea.  This patch set was just a "step one" kind of
thing.

I list the port guids of the nodes of the cluster from node0 to nodeN, one
per line in the file.  By listing the nodes in this order, I believe we
could get less contention in the network.  In the example above, sending
to node[13-24] should use all of the 12 uplinks, b/c the ports will be
equally used b/c nodes[1-12] were routed beforehand in order.

The results from some tests are pretty impressive when I do this. LMC=0
average bandwidth in mpiGraph goes from 391.374 MB/s to 573.678 MB/s
when I use guid_routing_order.
Can you compare this to the fat-tree routing?  Conceptually, fat-tree
is doing the same - it routes LIDs on nodes in a topological order, so
it would be interesting to see the comparison.
Actually I already did :-).  w/ LMC=0.

updn default - 391.374 MB/s
updn w/ guid_routing_order - 573.678 MB/s
ftree - 579.603 MB/s

I later discovered that one of the internal ports of the cluster I'm
testing on was broken (sLB of a 288 port), and think that is the cause
of some of the slowdown w/ updn w/ guid_routing_order.  So ftree (as
designed) seemed to be able to work around it properly, while updn (as
currently implemented) couldn't.

When we turn on LMC > 0, mpi libraries that are LMC > 0 aware were able
to do better on some tests than ftree.  One example (I think these
numbers are in microseconds.  Lower is better):

Alltoall 16K packets
ftree - 415490.6919
updn normal (LMC=0) - 495460.5526
updn w/ ordered routing (LMC=0) - 416562.7417
updn w/ ordered routing (LMC=1) - 453153.7289
 - this ^^^ result is quite odd.  Not sure why.
updn w/ ordered routing (LMC=2) - 3660132.1530

We are regularly debating what will be better overall at the end of the
day.

Also, fat-tree produces the guid order file automatically, but nobody
used it yet as an input to produce MPI rank file.
I didn't know about this option.  How do you do this (just skimmed the
manpage, didn't see anything)?

Right, it's missing there. I'll add this info.
The file is /var/log/opensm-ftree-ca-order.dump.
Small correction though - the file contains ordered list of HCA LIDs
and their host names. It's not a problem to change it to have guids
as well, but MPI doesn't need guids anyway.
Note that the optimal order might be different depending on the current
topology state and the location of the management node that runs OpenSM.

I know about the --cn_guid_file.  But
since that file doesn't have to be ordered, that's why I created a
different option (rather than have the cn_guid_file for both ftree and
updn).

Right, the cn file doesn't have to be ordered - ftree will order it
by itself. The ordering is by topology-adjacent leaf switches.

-- Yevgeny


Al

-- Yevgeny

A variety of other positive performance
increases were found when doing other tests, other MPIs, and other LMCs
if anyone is interested.

BTW, I developed this patch series before your preserve-base-lid patch
series.  It will 100% conflict with the preserve-base-lid patch series.
I will fix this patch series once the preserve-base-lids patch series is
committed to git.  I'm just looking for comments right now.

Al


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to