Hi Colin,

I’ve done some more digging and found that on the affected nodes the messages 
repeat at ~10 min intervals.
I can also see a lot of these errors in the MDS log:

Nov 25 10:56:02 mds01 kernel: LustreError: 
10370:0:(osp_precreate.c:964:osp_precreate_cleanup_orphans()) 
lustre01-OST000c-osc-MDT0000: cannot cleanup orphans: rc = -11
Nov 25 10:56:02 mds01 kernel: LustreError: 
10370:0:(osp_precreate.c:964:osp_precreate_cleanup_orphans()) Skipped 4 
previous similar messages
Nov 25 11:08:39 mds01 kernel: LustreError: 
10370:0:(osp_precreate.c:964:osp_precreate_cleanup_orphans()) 
lustre01-OST000c-osc-MDT0000: cannot cleanup orphans: rc = -11
Nov 25 11:08:39 mds01 kernel: LustreError: 
10370:0:(osp_precreate.c:964:osp_precreate_cleanup_orphans()) Skipped 4 
previous similar messages
Nov 25 11:21:16 mds01 kernel: LustreError: 
10370:0:(osp_precreate.c:964:osp_precreate_cleanup_orphans()) 
lustre01-OST000c-osc-MDT0000: cannot cleanup orphans: rc = -11

As you can see, these refer to another ost and the are repeated every ~14 mins.

On oss03 (serving ost000a – ost000e), no errors are logged after rebooting the 
clients, but I can see these messages:

Nov 25 19:08:02 oss03 kernel: Lustre: 
19728:0:(service.c:1372:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time 
(5/-150), not sending early reply#012  req@ffff9c6ec5550850 
x1713320906932288/t0(0) 
>[email protected]@o2ib:662/0<mailto:[email protected]@o2ib:662/0>
 lens 432/0 e 0 to 0 dl 1637863687 ref 2 fl New:/0/ffffffff rc 0/-1
Nov 25 19:08:02 oss03 kernel: Lustre: 
19728:0:(service.c:1372:ptlrpc_at_send_early_reply()) Skipped 4 previous 
similar messages
Nov 25 19:11:23 oss03 kernel: Lustre: lustre01-OST000b: Export ffff9c42c996fc00 
already connecting from 192.168.1.13@o2ib<mailto:192.168.1.13@o2ib>
Nov 25 19:11:23 oss03 kernel: Lustre: lustre01-OST000a: Export ffff9c4f43fb3c00 
already connecting from 192.168.1.13@o2ib<mailto:192.168.1.13@o2ib>

Also checked the Infiniband network, no errors found.
Servers are running CentOS 7.9 with Lustre 2.12.6 / zfs 3.10.0
Clients are running CentOS 7.2 with Lustre 2.8.0

Looks like a problem on oss03 ?

Hilsen Hallstein




Fra: Colin Faber <[email protected]>
Sendt: torsdag 25. november 2021 18:11
Til: Hallstein Løhre <[email protected]>
Kopi: [email protected]
Emne: Re: [lustre-discuss] ost_connect to node failed

-114 == operation in progress, what's the logging look like on both sides of 
the connection?

-cf


On Thu, Nov 25, 2021 at 5:18 AM Hallstein Løhre 
<[email protected]<mailto:[email protected]>> wrote:

Hi,

After some trouble with runaway processes yesterday, I had to reboot several 
Lustre clients. Now some of these shows the following entries in 
/var/log/messages:

Nov 25 11:09:51 nodexx kernel: LustreError: 11-0: 
lustre01-OST000a-osc-ffff887ee3207800: operation ost_connect to node 
192.168.1.xxx@o2ib<mailto:192.168.1.xxx@o2ib> failed: rc = -114

The filesystem seems ok, but the stuck processes might have accessed file(s) on 
OST000a. No hardware problem seems to exist, the ost’s are all zfs volumes with 
status ok.
I suspended writing to ost000a, but after reboot of the clients and checking 
for hardware problems, I have reenabled writing.
Any explanation of rc = -114 ?


Best Regards

Hallstein Løhre

ALPHA SYSTEM AS


_______________________________________________
lustre-discuss mailing list
[email protected]<mailto:[email protected]>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to