Oh, so don’t even tell the client about tcp! That seems to have immediately kicked things into place! I owe you a beverage of your choice if we ever meet up!
Seriously, the imposter syndrome was getting _bad_ the last few days here. > On Mar 5, 2025, at 12:05 PM, Horn, Chris <[email protected]> wrote: > > You need LNet routes configured on all nodes. It should look something like > this: > > # pdsh -w n0[0-3] 'lctl list_nids; lctl show_route' | dshbak -c > ---------------- > server > ---------------- > 172.18.2.5@o2ib > net o2ib2 hops 2 gw 172.18.2.6@o2ib up pri 0 > ---------------- > router1 > ---------------- > 172.18.2.6@o2ib > 172.18.2.2@tcp > net o2ib2 hops 1 gw 172.18.2.3@tcp up pri 0 > ---------------- > router2 > ---------------- > 172.18.2.7@o2ib2 > 172.18.2.3@tcp > net o2ib hops 1 gw 172.18.2.2@tcp up pri 0 > ---------------- > client > ---------------- > 172.18.2.8@o2ib2 > net o2ib hops 2 gw 172.18.2.7@o2ib2 up pri 0 > # > Chris Horn > From: lustre-discuss <[email protected]> on behalf of > John White via lustre-discuss <[email protected]> > Date: Wednesday, March 5, 2025 at 1:17 PM > To: [email protected] <[email protected]> > Subject: [lustre-discuss] multi-hop routing > Hello folks. I have a rare situation that I’m told some centers are > successfully pulling off and am looking for guidance - multi-hop lnet routing. > In short, I have 2 distinct o2ib fabrics at disparate geo sites joined by a > routed ethernet fabric. I’m looking to use a 2-lnet-router chain to plumb > the two o2ib fabrics together. > > servers on the left, clients on the right > o2ib0(10.5.0.0/16) <-> router(o2ib0,tcp0) <-> routed eth (10.37.0.0/16, > 10.38.0.0/16) <-> router(tcp0,o2ib2) <-> o2ib2(10.6.0.0/16) > > I have both sets of routers up but traffic absolutely fails the 2nd hop in > either direction (I can `lctl ping` tcp0 from o2ib2 and o2ib0 but no further). > > I’ve tried adding a route ON the routers, that didn’t help. > > I’ve tried defining the 2nd hop on the client: > options lnet routes="tcp0 10.6.0.[250-251]@o2ib2;\ > o2ib0 10.37.250.[162-163]@tcp0” > > but that failed with the following kern message on lnet load: > 74067:0:(router.c:644:lnet_add_route()) Cannot add route with gateway > 10.37.250.162@tcp. There is no local interface configured on LNet tcp > > Does anyone have any hints here? It feels like I’m a syntax change or a > routing hint away from getting this working. > _______________________________________________ > lustre-discuss mailing list > [email protected] > https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!NpxR!keuGPb7MHd7CQc6Zi_uwIvFahK68FJfbq9MNIXgHpd0W8bi5vOYFHf-IixYY5DiOnJKx0z9-Ht8VqH1ew82XWtaTRaoq$ _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
