Hey Jeff, > That works. The compute nodes need to talk to other compute nodes for > MPI over one set of links, and they need to talk to the Lustre nodes for > I/O, but over a different (disjoint) set of links. Thanks.
Is there a strong belief that a different/disjoint set of links would be beneficial? Sometime ago, Sasha and I iterated on a patch in which I found out sometimes not all switch ports would be used. In this particular case, a chunk of leaf switches were sometimes using only 11 out of 12 uplinks. After the fix, mpigraph showed about 20% improvement in MPI bandwidth. It obviously depends on your cluster/environment/apps/user usage pattern/etc. Livermore Lab's usage patterns will probably be different. Al On Thu, 2008-06-12 at 10:11 -0700, Jeff Becker wrote: > Hi Al > > Al Chu wrote: > > Hey Jeff, > > > > On Wed, 2008-06-11 at 09:43 -0700, Jeff Becker wrote: > > > >> Basically, we have an Altix ICE cluster connected by a pair of hypercube > >> Infiniband fabrics. External to that, we have some Lustre nodes > >> connected into the cluster with Infiniband. Our goal is to keep Lustre > >> traffic separate from compute (MPI) traffic. Ideally, we'd have 2 > >> subnets and an IB router between the Lustre fabric and the compute > >> fabric to accomplish this. > >> > > > > I see. In your environment, the lustre storage servers are on the same > > fabric as your compute nodes? > > > Right. > > > >> Barring that, I thought we could use partitions as follows: compute > >> HCA's and switch ports are on both partitions with full membership in > >> compute partition, and limited membership in I/O partition. The Lustre > >> nodes and switches would only be in the I/O partition (full > >> membership). That way, inter compute node (MPI) traffic would be > >> disallowed from using routes through the I/O fabric (by partition > >> membership), and I/O traffic could not interfere with compute (via > >> separate partitions). Is this scheme feasible? > >> > >> If that's not possible, the next idea is to modify OpenSM to assign > >> large weights to the links between the compute and I/O fabrics, so that > >> the MinHop algorithm would never consider using these links for > >> inter-compute node traffic. > >> > > > > So dedicating (for example) X out of Y uplinks for MPI only and the > > remaining uplinks for lustre only? > > > That works. The compute nodes need to talk to other compute nodes for > MPI over one set of links, and they need to talk to the Lustre nodes for > I/O, but over a different (disjoint) set of links. Thanks. > > -jeff > > Al > > > > > >> Thoughts? Thanks. > >> > >> -jeff > >> > >> Al Chu wrote: > >> > >>> Hey Jeff, > >>> > >>> Out of my curiosity, are you just trying to change the routing to > >>> improve job performance? i.e. lustre nodes get special routing vs. > >>> compute nodes? > >>> > >>> Al > >>> > >>> On Tue, 2008-06-10 at 15:08 -0700, Jeff Becker wrote: > >>> > >>> > >>>> Hi all. I was looking into doing some subnet partitioning to separate > >>>> compute nodes from Lustre nodes, and I saw the following in > >>>> ~sashak/management.git on the OFA server, in > >>>> opensm/doc/OpenSM_PKey_Mgr.txt > >>>> > >>>> OpenSM Partition Management > >>>> --------------------------- > >>>> > >>>> Roadmap: > >>>> Phase 1 - provide partition management at the EndPort (HCA, Router and > >>>> Switch > >>>> Port 0) level with no routing affects. > >>>> Phase 2 - routing engine should take partitions into account. > >>>> ... > >>>> Phase 2 functionality: > >>>> > >>>> The partition policy should be considered during the routing such that > >>>> links are associated with particular partition or a set of > >>>> partitions. Policy should be enhanced to provide hints for how to do > >>>> that (correlating to QoS too). The exact algorithm is TBD. > >>>> > >>>> > >>>> What is the status of Pkey-aware routing? Thanks. > >>>> > >>>> -jeff > >>>> > >>>> _______________________________________________ > >>>> general mailing list > >>>> [email protected] > >>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>>> > >>>> To unsubscribe, please visit > >>>> http://openib.org/mailman/listinfo/openib-general > >>>> > >>>> > -- Albert Chu [EMAIL PROTECTED] 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
