On Apr 9, 2025, at 14:28, John Bauer via lustre-discuss 
<[email protected]> wrote:

I have created a small reproducer program (81 lines of code) that results in a 
process that appears to hang in the kernel, accumulating cpu time.  The process 
is unresponsive to kill commands.  From gdb backtrace, it appears the call is 
stuck somewhere in fsetxattr() which is called by llapi_layout_file_open().  
The problem happens only when a non-existent ost is added to the layout with a 
call to llapi_layout_ost_index_set().  The call to llapi_layout_sanity(), just 
before calling llapi_layout_file_open(), returns 0.  Is this a known issue?

Hard to say for sure.

I suspect this is related to LU-17334, which relates to newly-added MDTs and 
OSTs in the filesystem. There were a few patches which recently landed in 
2.16.0 (and backported) that will sleep and retry for a short time to handle 
the case where a client accesses a file or directory layout that references an 
OST or MDT that it doesn't know about.  The assumption is that the OST/MDT is 
newly added and the configuration update hasn't quite made it to the client 
yet.  The client should retry to contact the new server for some time before 
giving up and returning an error (in case the layout is actually bad).

Whether this is fixed in your version depends on what the version is (not 
mentioned in your email).  It may also be important what the server version is, 
which can be seen from "lctl get_param mdc.*.import | grep target_version", if 
you can access this parameter.  if your client & server versions have the 
LU-17734 fixes, then this would be unexpected, and if older versions then I'd 
say it is something I'd rather not revisit until the known fixes are in place.

Cheers, Andreas
—
Andreas Dilger
Lustre Principal Architect
Whamcloud/DDN




_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to