Yes thank you Mark, that was really the piece of information I was looking for: among all available routes, LNET is smart enough to use them equally. So I can define all the routes on all my clients and all my servers, LNET will take care to use all routers equally.
Sebastien. D. Marc Stearman a écrit : > The routing configuration is in /etc/modprobe.conf, and routes can > also be dynamically added with lctl add_route. All of our lustre > servers are tcp servers, and we have client clusters that are tcp > only, and we also have IB and Elan clusters. Let's say you want to > add a router to one of the IB clusters: > > We'll assume that there is either a free port on the IB fabric, or we > change a compute node into a router node by adding some 10GigE hardware. > > The IB cluster is o2ib0 > The Lustre cluster is using tcp0 > The IB cluster routers have connections on o2ib0 and tcp0. > > Assuming you have an existing setup in place using either ip2nets or > networks in your modprobe.conf, and that you have existing routes > listed in the modprobe.conf, adding a router should be simple. On > the client side, add the routes in the modprobe.conf, and on the > lustre servers add the routes to the modprobe.conf. On the new > router, make sure it has the same modprobe.conf as the existing > routers. This will ensure that the configuration works after a > reboot. Since these are production clusters, we don't want to reboot > any of them, so we need to add the routes dynamically. Lets say that > the new router has IP address 172.16.1.100 on tcp0 and > 192.168.120.100 on o2ib0, you would need to run the following commands: > > On lustre servers: > lctl --net o2ib0 add_route [EMAIL PROTECTED] > > On IB clients: > lctl --net tcp0 add_route [EMAIL PROTECTED] > > The clients/servers will add the routes, and they will be down until > you start LNET on the new router: > service lnet start > > At LLNL, we have one large tcp0 that all the server clusters belong > to, and LNET is smart enough to use all the routers equally, so once > we add a new router, it just becomes part of the router pool for that > cluster, thereby increasing the bandwidth of that cluster. > > In reality, we rarely add new routers. We typcially spec out what we > call scalable units, so when we add onto a compute cluster, we add a > known chunk of compute servers, with a known number of routers. For > example, if a scalable unit is 144 compute nodes, and 4 IB/10 GigE > routers, then we may buy 4 scalable units, ending up with 576 compute > nodes, and 16 lustre routers. > > Hope that helps answer your question. > > -Marc > > ---- > D. Marc Stearman > LC Lustre Administration Lead > [EMAIL PROTECTED] > 925.423.9670 > Pager: 1.888.203.0641 > > > > On Apr 10, 2008, at 9:20 AM, Sébastien Buisson wrote: >> Hello Marc, >> >> Thank you for this feedback. This is a very exhaustive description of >> how you set up routers at the LLNL. >> Just one question however: according to you, a simple way to increase >> routing bandwidth is to add more Lustre routers, so that they are not >> the bottleneck in the cluster. But at the LLNL, how do you deal with >> Lustre routing configuration when you add new routers? I mean, how is >> the network load balanced between all routers? Is it done in a dynamic >> way that supports adding or removal of routers? >> >> Sebastien. >> >> >> D. Marc Stearman a écrit : >>> Sebastien, >>> >>> For the most part we try to match the bandwidth of the disks, to the >>> network, to the number of routers needed. I will be at the Lustre >>> User Group meeting in Sonoma, CA at the end of this month giving a >>> talk about Lustre at LLNL, including our network design, and router >>> usage, but here is a quick description. >>> >>> We have a large federated ethernet core. We then have edge switches >>> for each of our clusters that have links up to the core, and back >>> down to the routers or tcp-only clients. In a typical situation, if >>> we think one file system can achieve 20 GB/s based on disk bandwidth, >>> we try to make sure that the filesystem cluster has 20 GB/s network >>> bandwith (1GigE, 10GigE, etc), and that the routers for the compute >>> cluster total up to 20 GB/s as well. So we may have a server cluster >>> with servers having dual GigE links, and routers with 10 GigE links, >>> and we just try to match them up so the numbers are even. Typically, >>> the routers in a cluster are the same node type as the compute >>> cluster, just populated with additional network hardware. >>> >>> In the future, we will likely be building a router cluster that will >>> bridge our existing federated ethernet core to a large Infinband >>> network, but that is at least one year away. >>> >>> Most of our routers are rather simple, the have one high speed >>> interconnect HCA (Quadrics, Mellanox IB), and one network card ( dual >>> GigE, or single 10 GigE). I don't think we've hit any bus bandwidth >>> limitation, and I haven't seen any of them really pressed for CPU or >>> Memory. We do make sure to turn of irq_affinity when we have a >>> single network interface (the 10 GigE routers), and we've had to tune >>> the buffers and credits on the routers to get better throughput. We >>> have noticed a problem with serialization of checksum processing on a >>> single core (bz #14690). >>> >>> The beauty of routers though, is that if you find that they are all >>> running at capacity, you can always add a couple more, and move the >>> bottleneck to the network or disks. We find we are mostly slowed >>> down by the disks. >>> >>> -Marc >>> >>> ---- >>> D. Marc Stearman >>> LC Lustre Administration Lead >>> [EMAIL PROTECTED] >>> 925.423.9670 >>> Pager: 1.888.203.0641 >>> >>> >>> >>> On Apr 10, 2008, at 1:06 AM, Sébastien Buisson wrote: >>>> Let's consider that the internal bus of the machine is bigger >>>> enough so >>>> that it will not be saturated. In that case, what will be the >>>> limiting >>>> factor? memory? CPU? >>>> I know that it depends on how many I/B cards are plugged in the >>>> machine, >>>> but generally speaking, is the routing activity CPU or memory >>>> hungry? >>>> >>>> By the way, are there people on that list that have feedback about >>>> Lustre routers sizing? For instance, I know that Lustre routers have >>>> been set up at the LLNL. What is the throughput obtained via the >>>> routers, compared to the raw bandwidth of the interconnect? >>>> >>>> Thanks, >>>> Sebastien. >>>> >>>> >>>> Brian J. Murrell a écrit : >>>>> On Wed, 2008-04-09 at 19:07 +0200, Sébastien Buisson wrote: >>>>>> I mean, if I >>>>>> have an available bandwith of 100 on each side of a router, what >>>>>> will be >>>>>> the max reachable bandwith from clients on one side of the >>>>>> router to >>>>>> servers on the other side of the router? Is it 50? 80? 99? Is the >>>>>> routing process CPU or memory hungry? >>>>> While I can't answer these things specifically another important >>>>> consideration is the bus architecture involved. How many I/B >>>>> cards can >>>>> you put on a bus before you saturate the bus? >>>>> >>>>> b. >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------- >>>>> -- >>>>> --- >>>>> >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> [email protected] >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> [email protected] >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> _______________________________________________ >>> Lustre-discuss mailing list >>> [email protected] >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >> _______________________________________________ >> Lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
