All,

we are experiencing what looks like the same MDS LBUG with increasing 
frequency, see below for a sample stack trace. This seems to affect only 
one client at a time and even this client will recover after some time 
(usually minutes but sometimes longer) and continue to work even without 
requiring immediate MDS reboots.

In the recent past, it seems to have affected one specific client more 
often than others. This client is mainly a NFS exporter for the Lustre 
file system. All attempts to trigger the LBUG with known actions have 
been unsuccessful so far. Attempts to trigger it on the test file system 
have equally not been successful but we are still working on this.

As far as I can see, this could be this bug 
https://bugzilla.lustre.org/show_bug.cgi?id=17764 but there has been no 
recent activity. And I'm not entirely sure this is the same bug.

As far as I can see the log dumps don't contain any useful information, 
   but I'm happy to provide as sample file if someone offers to look at 
it.

I'm also looking for suggestions how to go about debugging this problem, 
ideally initially with as little performance impact as possible so we 
might apply it on the productions system until we can reproduce it on a 
test file system. Once we can reproduce it on the test file system, 
debugging with performance implications should be possible as well.

The MDS and clients are currently running Lustre 1.8.3.ddn3.3 on Red Hat 
Enterprise 5.

> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: 
> 4037:0:(mds_open.c:1295:mds_open()) 
> ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed: dchild 
> 8ad94b2:0cae8d46 (ffff8101995b0300) inode ffff81041d4e8548/145593522/21276602
> 2
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: 
> 4037:0:(mds_open.c:1295:mds_open()) LBUG
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: Lustre: 
> 4037:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process 
> 4037
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: ll_mdt_49     R  running task  
>      0  4037      1          4038  4036 (L-TLB)
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  ffff810226da0d00 
> ffff810247120000 0000000000000286 0000000000000082
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  0000008100001400 
> ffff8101db219ef8 0000000000000001 0000000000000001
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  ffff8101ead74db8 
> 0000000000000000 ffff810423223e10 ffffffff8008aee7
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: Call Trace:
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff8008aee7>] 
> __wake_up_common+0x3e/0x68
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff887acee8>] 
> :ptlrpc:ptlrpc_main+0x1258/0x1420
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff8008cabd>] 
> default_wake_function+0x0/0xe
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff800b7310>] 
> audit_syscall_exit+0x336/0x362
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff8005dfb1>] 
> child_rip+0xa/0x11
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff887abc90>] 
> :ptlrpc:ptlrpc_main+0x0/0x1420
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel:  [<ffffffff8005dfa7>] 
> child_rip+0x0/0x11
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: 
> Jul  6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: dumping log to 
> /tmp/lustre-log.1309949645.4037

Kind regards,
Frederik
-- 
Frederik Ferner
Computer Systems Administrator          phone: +44 1235 77 8624
Diamond Light Source Ltd.               mob:   +44 7917 08 5110
(Apologies in advance for the lines below. Some bits are a legal
requirement and I have no control over them.)

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 



_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to