All, we are experiencing what looks like the same MDS LBUG with increasing frequency, see below for a sample stack trace. This seems to affect only one client at a time and even this client will recover after some time (usually minutes but sometimes longer) and continue to work even without requiring immediate MDS reboots.
In the recent past, it seems to have affected one specific client more often than others. This client is mainly a NFS exporter for the Lustre file system. All attempts to trigger the LBUG with known actions have been unsuccessful so far. Attempts to trigger it on the test file system have equally not been successful but we are still working on this. As far as I can see, this could be this bug https://bugzilla.lustre.org/show_bug.cgi?id=17764 but there has been no recent activity. And I'm not entirely sure this is the same bug. As far as I can see the log dumps don't contain any useful information, but I'm happy to provide as sample file if someone offers to look at it. I'm also looking for suggestions how to go about debugging this problem, ideally initially with as little performance impact as possible so we might apply it on the productions system until we can reproduce it on a test file system. Once we can reproduce it on the test file system, debugging with performance implications should be possible as well. The MDS and clients are currently running Lustre 1.8.3.ddn3.3 on Red Hat Enterprise 5. > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: > 4037:0:(mds_open.c:1295:mds_open()) > ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed: dchild > 8ad94b2:0cae8d46 (ffff8101995b0300) inode ffff81041d4e8548/145593522/21276602 > 2 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: > 4037:0:(mds_open.c:1295:mds_open()) LBUG > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: Lustre: > 4037:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process > 4037 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: ll_mdt_49 R running task > 0 4037 1 4038 4036 (L-TLB) > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: ffff810226da0d00 > ffff810247120000 0000000000000286 0000000000000082 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: 0000008100001400 > ffff8101db219ef8 0000000000000001 0000000000000001 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: ffff8101ead74db8 > 0000000000000000 ffff810423223e10 ffffffff8008aee7 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: Call Trace: > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff8008aee7>] > __wake_up_common+0x3e/0x68 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff887acee8>] > :ptlrpc:ptlrpc_main+0x1258/0x1420 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff8008cabd>] > default_wake_function+0x0/0xe > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff800b7310>] > audit_syscall_exit+0x336/0x362 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff8005dfb1>] > child_rip+0xa/0x11 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff887abc90>] > :ptlrpc:ptlrpc_main+0x0/0x1420 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: [<ffffffff8005dfa7>] > child_rip+0x0/0x11 > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: > Jul 6 11:54:05 cs04r-sc-mds01-01-10ge kernel: LustreError: dumping log to > /tmp/lustre-log.1309949645.4037 Kind regards, Frederik -- Frederik Ferner Computer Systems Administrator phone: +44 1235 77 8624 Diamond Light Source Ltd. mob: +44 7917 08 5110 (Apologies in advance for the lines below. Some bits are a legal requirement and I have no control over them.) -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
