Hi Steve, You're welcome for the suggestion. I offered it as you mentioned adding a couple new oss servers and noticing the entries in the logs. Helpful to know would be where you are seeing the errors - new nodes only, or ?? Generally, networks with existing problems seems to work ok at low bandwidths, but problems start to appear as loads increase - hence the suggestion to check the network for problems. A quick check could be made with LNet self test between two different sets of nodes - set 1 nodes indicate the problem, and set 2 do not. Best, On Dec 11, 2016 6:05 PM, "Steve Barnet" <[email protected]> wrote:
> Hi Brett, > > > On 12/11/16 4:46 PM, Brett Lee wrote: > >> Steve, It might be the network that LNet is running on. Have you run >> some bandwidth tests without LNet to check for network problems? >> > > > It's running over a 10Gb/s Ethernet network that is carrying > other OSS traffic successfully. No routers or other fancy LNET > features in play. However, it is quite possible that there are > issues with the networking on the host side. Definitely on my > list of things to test out. > > At this point, I'm just trying to narrow the search space. > I didn't find anything particularly revealing when I searched > around, so I'm hoping some expert eyes can shine a bit of > light on the situation. > > Thanks for the tip! > > Best, > > ---Steve > > >> On Dec 11, 2016 3:37 PM, "Steve Barnet" <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi all, >> >> Seeing something very strange. I recently added two OSSes >> and 10 OSTs to one of our filesystems. Things look OK under >> light loads, but when we load them up, we start seeing lots >> of LNet errors. >> >> OS: Scientific Linux 6.7 >> Lustre - Server: 2.8.0 Community version >> Lustre - Client: 2.5.3 >> >> The errors are below. Do these narrow the range of possible >> problems? >> >> >> Dec 11 11:17:39 lfs-ex-oss-20 kernel: LNetError: >> 7732:0:(socklnd_cb.c:2509:ksocknal_check_peer_timeouts()) Total 4 >> stale ZC_REQs for peer 10.128.10.29@tcp1 detected; the >> oldest(ffff880f6a90e000) timed out 7 secs ago, resid: 0, wmem: 0 >> Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError: >> 7732:0:(events.c:447:server_bulk_callback()) event type 5, status >> -5, desc ffff8805379f8000 >> Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError: >> 7732:0:(events.c:447:server_bulk_callback()) event type 5, status >> -5, desc ffff880f375dc000 >> Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError: >> 8234:0:(ldlm_lib.c:3175:target_bulk_io()) @@@ network error on bulk >> READ req@ffff880e506263c0 x1551187318090340/t0(0) >> o3->[email protected]@tcp1:587/0 >> lens 488/432 e 3 to 0 dl 1481476687 ref 1 fl Interpret:/0/0 rc 0/0 >> Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError: >> 8234:0:(ldlm_lib.c:3175:target_bulk_io()) Skipped 1 previous similar >> message >> Dec 11 11:17:39 lfs-ex-oss-20 kernel: Lustre: lfs2-OST0024: Bulk IO >> read error with 092e941d-272a-09e3-502b-9338dbf387d3 (at >> 10.128.10.29@tcp1), client will retry: rc -110 >> Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError: >> 7732:0:(events.c:447:server_bulk_callback()) event type 5, status >> -5, desc ffff8804db0ce000 >> Dec 11 11:17:39 lfs-ex-oss-20 kernel: LustreError: >> 7732:0:(events.c:447:server_bulk_callback()) event type 5, status >> -5, desc ffff880aa4374000 >> >> >> Thanks much! >> >> Best, >> >> ---Steve >> >> _______________________________________________ >> lustre-discuss mailing list >> [email protected] <mailto:[email protected] >> ustre.org> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >> <http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org> >> >> >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
