For the record:
This was triggered by one job script using a working directory in a directory tree under MDT0 and
redirecting stderr of gzip commands to a directory in a treen under MDT1.
Once the user put all of it in either one or the other tree, the problem
disappeared.
Thomas
On 03/12/2018 09:54 AM, Thomas Roth wrote:
Hi all,
our production system running Lustre 2.5.3 has broken down, and I'm quite
clueless.
The second (of two) MDTs crashed and after reboot + recovery LBUGs again with:
Mar 11 20:02:37 lxmds15 kernel: Lustre: nyx-MDT0001: Recovery over after 1:36, of 720 clients 720
recovered and 0 were evicted.
Mar 11 20:02:37 lxmds15 kernel: LustreError:
6705:0:(osp_precreate.c:719:osp_precreate_cleanup_orphans()) nyx-OST0001-osc-MDT0001: cannot cleanup
orphans: rc = -108
Mar 11 20:02:37 lxmds15 kernel: LustreError:
6705:0:(osp_precreate.c:719:osp_precreate_cleanup_orphans()) Skipped 74 previous similar messages
Mar 11 20:02:37 lxmds15 kernel: LustreError: 6574:0:(mdt_handler.c:2706:mdt_object_lock0()) ASSERTION(
!(ibits & (MDS_INODELOCK_UPDATE |
MDS_INODELOCK_PERM)) ) failed: nyx-MDT0001: wrong bit 0x2 for remote obj
[0x5100027c70:0x17484:0x0]
Mar 11 20:02:37 lxmds15 kernel: LustreError:
6574:0:(mdt_handler.c:2706:mdt_object_lock0()) LBUG
This seems to be LU-6071, but I am wondering what actually causes it - there should be no ongoing
attempts from a client to create a directory on the second MDT.
After doing an e2fsck on the MDT, it mounts and then crashes with a different FID each time. (If
mounted without fsck, the crashing FID remains the same.)
Is there any way we can find out more about the cause?
If it is a finite number of troubling inodes, is there a trick to
manipulate/clear these?
Regards,
Thomas
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org