Hi Al
Al Chu wrote:
Hey Jeff,
That works. The compute nodes need to talk to other compute nodes for
MPI over one set of links, and they need to talk to the Lustre nodes for
I/O, but over a different (disjoint) set of links. Thanks.
Is there a strong belief that a different/disjoint set of links would be
beneficial? Sometime ago, Sasha and I iterated on a patch in which I
found out sometimes not all switch ports would be used. In this
particular case, a chunk of leaf switches were sometimes using only 11
out of 12 uplinks. After the fix, mpigraph showed about 20% improvement
in MPI bandwidth.
Basically, we want to avoid situations where I/O and MPI contend for the
same links, and get in each other's way.
-jeff
It obviously depends on your cluster/environment/apps/user usage
pattern/etc. Livermore Lab's usage patterns will probably be different.
Al
On Thu, 2008-06-12 at 10:11 -0700, Jeff Becker wrote:
Hi Al
Al Chu wrote:
Hey Jeff,
On Wed, 2008-06-11 at 09:43 -0700, Jeff Becker wrote:
Basically, we have an Altix ICE cluster connected by a pair of hypercube
Infiniband fabrics. External to that, we have some Lustre nodes
connected into the cluster with Infiniband. Our goal is to keep Lustre
traffic separate from compute (MPI) traffic. Ideally, we'd have 2
subnets and an IB router between the Lustre fabric and the compute
fabric to accomplish this.
I see. In your environment, the lustre storage servers are on the same
fabric as your compute nodes?
Right.
Barring that, I thought we could use partitions as follows: compute
HCA's and switch ports are on both partitions with full membership in
compute partition, and limited membership in I/O partition. The Lustre
nodes and switches would only be in the I/O partition (full
membership). That way, inter compute node (MPI) traffic would be
disallowed from using routes through the I/O fabric (by partition
membership), and I/O traffic could not interfere with compute (via
separate partitions). Is this scheme feasible?
If that's not possible, the next idea is to modify OpenSM to assign
large weights to the links between the compute and I/O fabrics, so that
the MinHop algorithm would never consider using these links for
inter-compute node traffic.
So dedicating (for example) X out of Y uplinks for MPI only and the
remaining uplinks for lustre only?
That works. The compute nodes need to talk to other compute nodes for
MPI over one set of links, and they need to talk to the Lustre nodes for
I/O, but over a different (disjoint) set of links. Thanks.
-jeff
Al
Thoughts? Thanks.
-jeff
Al Chu wrote:
Hey Jeff,
Out of my curiosity, are you just trying to change the routing to
improve job performance? i.e. lustre nodes get special routing vs.
compute nodes?
Al
On Tue, 2008-06-10 at 15:08 -0700, Jeff Becker wrote:
Hi all. I was looking into doing some subnet partitioning to separate
compute nodes from Lustre nodes, and I saw the following in
~sashak/management.git on the OFA server, in opensm/doc/OpenSM_PKey_Mgr.txt
OpenSM Partition Management
---------------------------
Roadmap:
Phase 1 - provide partition management at the EndPort (HCA, Router and Switch
Port 0) level with no routing affects.
Phase 2 - routing engine should take partitions into account.
...
Phase 2 functionality:
The partition policy should be considered during the routing such that
links are associated with particular partition or a set of
partitions. Policy should be enhanced to provide hints for how to do
that (correlating to QoS too). The exact algorithm is TBD.
What is the status of Pkey-aware routing? Thanks.
-jeff
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general