Hello,
I'm tracking a very vexing issue. Somehow users are creating files that
cause attempts to examine or manipulate the files to hang. They can't be
removed, they can't be examined, even an ls command will hang.
an strace on an ls command run against some of these files produced the
following:
getxattr("/volatile/halld/home/haoli/RunPeriod-2017-01/analysis/ver36_Mar27/log/030408/stdout.030408_124.out",
"system.posix_acl_default", NULL, 0) = -1 ENODATA (No data available)
lstat("/volatile/halld/home/haoli/RunPeriod-2017-01/analysis/ver36_Mar27/log/030408/stderr.030408_118.err",
{st_mode=S_IFREG|0644, st_size=16979, ...}) = 0
getxattr("/volatile/halld/home/haoli/RunPeriod-2017-01/analysis/ver36_Mar27/log/030408/stderr.030408_118.err",
"system.posix_acl_access", NULL, 0) = -1 ENODATA (No data available)
getxattr("/volatile/halld/home/haoli/RunPeriod-2017-01/analysis/ver36_Mar27/log/030408/stderr.030408_118.err",
"system.posix_acl_default", NULL, 0) = -1 ENODATA (No data available)
lstat("/volatile/halld/home/haoli/RunPeriod-2017-01/analysis/ver36_Mar27/log/030408/stdout.030408_000.out",
On one client that was being used to debug this problem I see the following
(running CentOS 6.5 and luster 2.5.42)
Lustre: 2753:0:(client.c:1920:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1585677903/real 1585677903]
req@ffff880238fdd000 x1657754295856608/t0(0)
o101->[email protected]@o2ib:28/4 lens 328/400
e 1 to 1 dl 1585678504 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
Lustre: lustre19-OST0028-osc-ffff88105fecd000: Connection to lustre19-OST0028
(at 172.17.0.99@o2ib) was lost; in progress operations using this service will
wait for recovery to complete
Lustre: lustre19-OST0028-osc-ffff88105fecd000: Connection restored to
lustre19-OST0028 (at 172.17.0.99@o2ib)
Lustre: 2733:0:(client.c:304:ptlrpc_at_adj_net_latency()) Reported service time
465 > total measured time 96
LustreError: 2733:0:(layout.c:2005:__req_capsule_get()) @@@ Wrong buffer for
field `dlm_rep' (1 of 1) in format `LDLM_ENQUEUE_LVB': 0 vs. 112 (server)
req@ffff8802149c5800 x1657754295862208/t0(0)
o101->[email protected]@o2ib:28/4 lens 328/192
e 0 to 0 dl 1585679147 ref 1 fl Interpret:R/2/0 rc 0/0
LustreError: 2733:0:(layout.c:2005:__req_capsule_get()) Skipped 1 previous
similar message
On newer systems (RHEL 7.7 running 2.10.8-1) the problem also occurs and I see
the following in dmesg
[5437818.727792] Lustre: lustre19-OST0028-osc-ffff886229244000: Connection
restored to 172.17.0.99@o2ib (at 172.17.0.99@o2ib)
[5438419.769959] Lustre: 2747:0:(client.c:2116:ptlrpc_expire_one_request()) @@@
Request sent has timed out for slow reply: [sent 1585678345/real 1585678345]
req@ffff884834ec0600 x1657002677904272/t0(0)
o101->[email protected]@o2ib:28/4 lens 344/400
e 24 to 1 dl 1585678946 ref 1 fl Rpc:X/2/ffffffff rc -11/-1
[5438419.769973] Lustre: lustre19-OST0028-osc-ffff886229244000: Connection to
lustre19-OST0028 (at 172.17.0.99@o2ib) was lost; in progress operations using
this service will wait for recovery to complete
[5438419.770435] Lustre: lustre19-OST0028-osc-ffff886229244000: Connection
restored to 172.17.0.99@o2ib (at 172.17.0.99@o2ib)
df and lfs df commands on these systems aren't hanging.
on the OSS we are running: lustre 2.12.1-1 and CentOS 7.6) I'm seeing the
following:
[Mar31 14:26] LNet: Service thread pid 309349 was inactive for 1121.43s. The
thread might be hung, or it might only be slow and will resume later. Dumping
the stack trace for debugging purposes:
[ +0.000007] Pid: 309349, comm: ll_ost00_088 3.10.0-957.10.1.el7_lustre.x86_64
#1 SMP Tue Apr 30 22:18:15 UTC 2019
[ +0.000002] Call Trace:
[ +0.000014] [<ffffffffc130d560>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
[ +0.000074] [<ffffffffc12cafe5>] ldlm_run_ast_work+0xd5/0x3a0 [ptlrpc]
[ +0.000039] [<ffffffffc12ec54b>] ldlm_glimpse_locks+0x3b/0x100 [ptlrpc]
[ +0.000045] [<ffffffffc16a63fb>] ofd_intent_policy+0x69b/0x920 [ofd]
[ +0.000015] [<ffffffffc12cbd26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[ +0.000038] [<ffffffffc12f4587>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[ +0.000046] [<ffffffffc137c502>] tgt_enqueue+0x62/0x210 [ptlrpc]
[ +0.000072] [<ffffffffc1380e5a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[ +0.000065] [<ffffffffc132680b>] ptlrpc_server_handle_request+0x24b/0xab0
[ptlrpc]
[ +0.000056] [<ffffffffc132a13c>] ptlrpc_main+0xafc/0x1fc0 [ptlrpc]
[ +0.000054] [<ffffffff8dac1c71>] kthread+0xd1/0xe0
[ +0.000007] [<ffffffff8e175c1d>] ret_from_fork_nospec_begin+0x7/0x21
[ +0.000009] [<ffffffffffffffff>] 0xffffffffffffffff
[ +0.000036] LustreError: dumping log to /tmp/lustre-log.1585679216.309349
Thank you,
Kurt J. Strosahl
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org