On Monday 04 January 2010 20:42:12 Andreas Dilger wrote: > On 2010-01-04, at 03:02, David Cohen wrote: > > I'm using a mixed environment of 1.8.0.1 MDS and 1.6.6 OSS's (had a > > problem > > with qlogic drivers and rolled back to 1.6.6). > > My MDS get unresponsive each day at 4-5 am local time, no kernel > > panic or > > error messages before.
I was indeed the *locate update, a simple edit of /etc/updatedb.conf on the clients and the system is stable again. Many Thanks. > > Judging by the time, I'd guess this is "slocate" or "mlocate" running > on all of your clients at the same time. This used to be a source of > extremely high load back in the old days, but I thought that Lustre > was in the exclude list in newer versions of *locate. Looking at the > installed mlocate on my system, that doesn't seem to be the case... > strange. > > > Some errors and an LBUG appear in the log after force booting the > > MDS and > > mounting the MDT and then the log is clear until next morning: > > > > Jan 4 06:33:31 tech-mds kernel: LustreError: 6357:0: > > (class_hash.c:225:lustre_hash_findadd_unique_hnode()) > > ASSERTION(hlist_unhashed(hnode)) failed > > Jan 4 06:33:31 tech-mds kernel: LustreError: 6357:0: > > (class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG > > Jan 4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux- > > debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357 > > Jan 4 06:33:31 tech-mds kernel: ll_mgs_02 R running task > > 0 6357 > > 1 6340 (L-TLB) > > Jan 4 06:33:31 tech-mds kernel: Call Trace: > > Jan 4 06:33:31 tech-mds kernel: thread_return+0x62/0xfe > > Jan 4 06:33:31 tech-mds kernel: __wake_up_common+0x3e/0x68 > > Jan 4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x1218/0x13e0 > > Jan 4 06:33:31 tech-mds kernel: default_wake_function+0x0/0xe > > Jan 4 06:33:31 tech-mds kernel: audit_syscall_exit+0x31b/0x336 > > Jan 4 06:33:31 tech-mds kernel: child_rip+0xa/0x11 > > Jan 4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x0/0x13e0 > > Jan 4 06:33:31 tech-mds kernel: child_rip+0x0/0x11 > > It shouldn't LBUG during recovery, however. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > -- David Cohen Grid Computing Physics Department Technion - Israel Institute of Technology _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss