Andreas,
Thanks for the quick reply. The client version is 2.14.0_ddn173. The
server version is also target_version: 2.14.0.173. This originally
started as the result of user input error that requested an OST that
does not exist. For my simple test case I request an OST that does not
exist, and probably never will exist. This issue is on plieades at
NAS/NASA which doesn't change very much. I doubt that this related to
an OST or MDT that may have been recently added.
The admins are checking on LU-17334.
The admins also noticed thousands of error messages
[root@r593i4n16 ~]# dmesg -T |grep LustreError
[Wed Apr 9 15:36:22 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142@o2ib failed: rc = -19
[Wed Apr 9 15:36:23 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142@o2ib failed: rc = -19
[Wed Apr 9 15:36:23 2025] LustreError: Skipped 1709 previous similar
messages
[Wed Apr 9 15:36:24 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142@o2ib failed: rc = -19
[Wed Apr 9 15:36:24 2025] LustreError: Skipped 3491 previous similar
messages
[Wed Apr 9 15:36:26 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142@o2ib failed: rc = -19
[Wed Apr 9 15:36:26 2025] LustreError: Skipped 7803 previous similar
messages
[Wed Apr 9 15:36:30 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142@o2ib failed: rc = -19
[Wed Apr 9 15:36:30 2025] LustreError: Skipped 14891 previous similar
messages
[Wed Apr 9 15:36:38 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142@o2ib failed: rc = -19
[Wed Apr 9 15:36:38 2025] LustreError: Skipped 29887 previous similar
messages
[Wed Apr 9 15:36:54 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142@o2ib failed: rc = -19
[Wed Apr 9 15:36:54 2025] LustreError: Skipped 63032 previous similar
messages
[Wed Apr 9 15:37:26 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142@o2ib failed: rc = -19
[Wed Apr 9 15:37:26 2025] LustreError: Skipped 120772 previous similar
messages
[Wed Apr 9 15:38:30 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142@o2ib failed: rc = -19
[Wed Apr 9 15:38:30 2025] LustreError: Skipped 238498 previous similar
messages
[Wed Apr 9 15:40:38 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142@o2ib failed: rc = -19
[Wed Apr 9 15:40:38 2025] LustreError: Skipped 515538 previous similar
messages
[Wed Apr 9 15:44:54 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142@o2ib failed: rc = -19
[Wed Apr 9 15:44:54 2025] LustreError: Skipped 1040417 previous similar
messages
[root@r593i4n16 ~]#
John
On 4/9/2025 4:58 PM, Andreas Dilger wrote:
On Apr 9, 2025, at 14:28, John Bauer via lustre-discuss
<[email protected]> wrote:
I have created a small reproducer program (81 lines of code) that
results in a process that appears to hang in the kernel, accumulating
cpu time. The process is unresponsive to kill commands. From gdb
backtrace, it appears the call is stuck somewhere in fsetxattr()
which is called by llapi_layout_file_open(). The problem happens
only when a non-existent ost is added to the layout with a call to
llapi_layout_ost_index_set(). The call to llapi_layout_sanity(),
just before calling llapi_layout_file_open(), returns 0. Is this a
known issue?
Hard to say for sure.
I suspect this is related to LU-17334, which relates to newly-added
MDTs and OSTs in the filesystem. There were a few patches which
recently landed in 2.16.0 (and backported) that will sleep and retry
for a short time to handle the case where a client accesses a file or
directory layout that references an OST or MDT that it doesn't know
about. The assumption is that the OST/MDT is newly added and the
configuration update hasn't quite made it to the client yet. The
client should retry to contact the new server for some time before
giving up and returning an error (in case the layout is actually bad).
Whether this is fixed in your version depends on what the version is
(not mentioned in your email). It may also be important what the
server version is, which can be seen from "lctl get_param mdc.*.import
| grep target_version", if you can access this parameter. if your
client & server versions have the LU-17734 fixes, then this would be
unexpected, and if older versions then I'd say it is something I'd
rather not revisit until the known fixes are in place.
Cheers, Andreas
—
Andreas Dilger
Lustre Principal Architect
Whamcloud/DDN
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org