For the record:

This was triggered by one job script using a working directory in a directory tree under MDT0 and redirecting stderr of gzip commands to a directory in a treen under MDT1.
Once the user put all of it in either one or the other tree, the problem 
disappeared.

Thomas

On 03/12/2018 09:54 AM, Thomas Roth wrote:
Hi all,

our production system running Lustre 2.5.3 has broken down, and I'm quite 
clueless.

The second (of two) MDTs crashed and after reboot + recovery LBUGs again with:


Mar 11 20:02:37 lxmds15 kernel: Lustre: nyx-MDT0001: Recovery over after 1:36, of 720 clients 720 recovered and 0 were evicted.

Mar 11 20:02:37 lxmds15 kernel: LustreError: 6705:0:(osp_precreate.c:719:osp_precreate_cleanup_orphans()) nyx-OST0001-osc-MDT0001: cannot cleanup
orphans: rc = -108

Mar 11 20:02:37 lxmds15 kernel: LustreError: 6705:0:(osp_precreate.c:719:osp_precreate_cleanup_orphans()) Skipped 74 previous similar messages

Mar 11 20:02:37 lxmds15 kernel: LustreError: 6574:0:(mdt_handler.c:2706:mdt_object_lock0()) ASSERTION( !(ibits & (MDS_INODELOCK_UPDATE |
MDS_INODELOCK_PERM)) ) failed: nyx-MDT0001: wrong bit 0x2 for remote obj 
[0x5100027c70:0x17484:0x0]

Mar 11 20:02:37 lxmds15 kernel: LustreError: 
6574:0:(mdt_handler.c:2706:mdt_object_lock0()) LBUG



This seems to be LU-6071, but I am wondering what actually causes it - there should be no ongoing attempts from a client to create a directory on the second MDT.


After doing an e2fsck on the MDT, it mounts and then crashes with a different FID each time. (If mounted without fsck, the crashing FID remains the same.)


Is there any way we can find out more about the cause?

If it is a finite number of troubling inodes, is there a trick to 
manipulate/clear these?


Regards,
Thomas


_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to