Just a quick follow-up for posterity, I did seem to need to add a route for tcp to the server-side. lctl ping was working but MGS communication was failing saying it couldn’t talk back to the router: [Wed Mar 5 15:28:26 2025] LNetError: 28576:0:(lib-move.c:2078:lnet_handle_find_routed_path()) no route to 10.38.0.250@tcp from 10.5.250.22@o2ib [Wed Mar 5 15:28:26 2025] LNetError: 28576:0:(lib-move.c:3991:lnet_parse_get()) 10.5.250.22@o2ib: Unable to send REPLY for GET from 12345-10.38.0.250@tcp: -113
Adding a route to tcp from it’s geo-local router fixed that and we’ve got mounts passing IO. Didn’t seem to need to do the same for clients at all. > On Mar 5, 2025, at 2:29 PM, John White <[email protected]> wrote: > > Oh, so don’t even tell the client about tcp! That seems to have immediately > kicked things into place! > I owe you a beverage of your choice if we ever meet up! > > Seriously, the imposter syndrome was getting _bad_ the last few days here. > >> On Mar 5, 2025, at 12:05 PM, Horn, Chris <[email protected]> wrote: >> >> You need LNet routes configured on all nodes. It should look something like >> this: >> >> # pdsh -w n0[0-3] 'lctl list_nids; lctl show_route' | dshbak -c >> ---------------- >> server >> ---------------- >> 172.18.2.5@o2ib >> net o2ib2 hops 2 gw 172.18.2.6@o2ib up pri 0 >> ---------------- >> router1 >> ---------------- >> 172.18.2.6@o2ib >> 172.18.2.2@tcp >> net o2ib2 hops 1 gw 172.18.2.3@tcp up pri 0 >> ---------------- >> router2 >> ---------------- >> 172.18.2.7@o2ib2 >> 172.18.2.3@tcp >> net o2ib hops 1 gw 172.18.2.2@tcp up pri 0 >> ---------------- >> client >> ---------------- >> 172.18.2.8@o2ib2 >> net o2ib hops 2 gw 172.18.2.7@o2ib2 up pri 0 >> # >> Chris Horn >> From: lustre-discuss <[email protected]> on behalf of >> John White via lustre-discuss <[email protected]> >> Date: Wednesday, March 5, 2025 at 1:17 PM >> To: [email protected] <[email protected]> >> Subject: [lustre-discuss] multi-hop routing >> Hello folks. I have a rare situation that I’m told some centers are >> successfully pulling off and am looking for guidance - multi-hop lnet >> routing. >> In short, I have 2 distinct o2ib fabrics at disparate geo sites joined by a >> routed ethernet fabric. I’m looking to use a 2-lnet-router chain to plumb >> the two o2ib fabrics together. >> >> servers on the left, clients on the right >> o2ib0(10.5.0.0/16) <-> router(o2ib0,tcp0) <-> routed eth (10.37.0.0/16, >> 10.38.0.0/16) <-> router(tcp0,o2ib2) <-> o2ib2(10.6.0.0/16) >> >> I have both sets of routers up but traffic absolutely fails the 2nd hop in >> either direction (I can `lctl ping` tcp0 from o2ib2 and o2ib0 but no >> further). >> >> I’ve tried adding a route ON the routers, that didn’t help. >> >> I’ve tried defining the 2nd hop on the client: >> options lnet routes="tcp0 10.6.0.[250-251]@o2ib2;\ >> o2ib0 10.37.250.[162-163]@tcp0” >> >> but that failed with the following kern message on lnet load: >> 74067:0:(router.c:644:lnet_add_route()) Cannot add route with gateway >> 10.37.250.162@tcp. There is no local interface configured on LNet tcp >> >> Does anyone have any hints here? It feels like I’m a syntax change or a >> routing hint away from getting this working. >> _______________________________________________ >> lustre-discuss mailing list >> [email protected] >> https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!NpxR!keuGPb7MHd7CQc6Zi_uwIvFahK68FJfbq9MNIXgHpd0W8bi5vOYFHf-IixYY5DiOnJKx0z9-Ht8VqH1ew82XWtaTRaoq$ > > _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
