I can't comment on the LNet peer discovery part, but I would definitely not 
tecommend to leave the lnet_transaction_timeout that low for normal usage. This 
can cause messages to be dropped while the server is processing them and 
introduce failures needlessly. 

Cheers, Andreas

> On Oct 26, 2023, at 09:48, Bertschinger, Thomas Andrew Hjorth via 
> lustre-discuss <[email protected]> wrote:
> 
> Hello,
> 
> Recently we had an OSS node down for an extended period with hardware 
> problems. While the node was down, mounting lustre on a client took an 
> extremely long time to complete (20-30 minutes). Once the fs is mounted, all 
> operations are normal and there isn't any noticeable impact from the absent 
> node.
> 
> While the client is mounting, the client's debug log shows entries like this 
> slowly going by:
> 
> 00000020:00000080:87.0:1698333195.993098:0:3801046:0:(obd_config.c:1384:class_process_config())
>  processing cmd: cf005
> 00000020:00000080:87.0:1698333195.993099:0:3801046:0:(obd_config.c:1396:class_process_config())
>  adding mapping from uuid 10.1.2.3@o2ib to nid 0x500000abcd123 (10.1.2.4@o2ib)
> 
> and there is a "llog_process_th" kernel thread hanging in 
> lnet_discover_peer_locked().
> 
> We have peer discovery enabled on our clients, but disabling peer discovery 
> on a client causes the mount to complete quickly. Also, once the down OSS was 
> fixed and powered back on, mounting completed normally again.
> 
> We also found that reducing the following timeout sped up the mount by a 
> factor of ~10:
> 
> $ lnetctl set transaction_timeout 5    # was 50 originally
> 
> Is such a dramatic slowdown normal in this situation? Is there any fix (aside 
> from disabling peer discovery or tuning down the timeout) that could speed up 
> mounts in case we have another OSS down in the future?
> 
> Lustre version (server and client): 2.15.3
> 
> Thanks, 
> Thomas Bertschinger
> _______________________________________________
> lustre-discuss mailing list
> [email protected]
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
  • [lustre-discuss] ver... Bertschinger, Thomas Andrew Hjorth via lustre-discuss
    • Re: [lustre-dis... Andreas Dilger via lustre-discuss

Reply via email to