Greetings, I have a Lustre OSS with eleven (0-11) OSTs. Every once in a while the OSS hosting the OSTs fails with a kernel panic. The system runs CentOS 5.1 using Lustre kernel 2.6.18-53.1.13.el5_lustre.1.6.4.3smp. There are no error messages in the /var/log/messages file for March 4 prior to the message printed below. The last line in the /var/log/messages file was a routine stamp from March 2.
How do I understand the "lock callback timer expired message below? After the dump the system shows "kernel panic" on console and requires a manual reboot. Any tips and insight greatly appreciated. megan ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Mar 4 09:42:57 oss4 kernel: LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) ### lock callback timer expired: evicting client 9e1d3bc1-201b-4d0b-cc1a-2d52d619c...@net_0x50000c0a840d6_uuid nid 192.168.64....@o2ib ns: filter-crew8-OST0004_UUID lock: ffff81039e9bdd80/0x99e7393d0850f39f lrc: 2/0,0 mode: PR/PR res: 1267155/0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 20 remote: 0x8a3b31c963e264e8 expref: 830 pid: 4250 Mar 4 09:42:59 oss4 kernel: LustreError: 4989:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) r...@ffff8102fdae7400 x4386790/t0 o101-><?>@<?>:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 Mar 4 09:42:59 oss4 kernel: LustreError: 4989:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 3 previous similar messages Mar 4 09:43:47 oss4 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 4206: it was inactive for 100s Mar 4 09:43:47 oss4 kernel: Lustre: 0:0:(linux-debug.c:168:libcfs_debug_dumpstack()) showing stack for process 4206 Mar 4 09:43:47 oss4 kernel: ll_ost_66 D 0000000000000580 0 4206 1 4207 4205 (L-TLB) Mar 4 09:43:47 oss4 kernel: ffff810424c99618 0000000000000046 0000000000000001 0000000000000080 Mar 4 09:43:47 oss4 kernel: 000000000000000a ffff810425134860 ffffffff802dcae0 0003a25c7b1d51cd Mar 4 09:43:47 oss4 kernel: 000000000000052d ffff810425134a48 ffffffff00000000 ffff81042ed6f7e0 Mar 4 09:43:47 oss4 kernel: Call Trace: Mar 4 09:43:47 oss4 kernel: [<ffffffff80061bb1>] __mutex_lock_slowpath+0x55/0x90 Mar 4 09:43:47 oss4 kernel: [<ffffffff80061bf1>] .text.lock.mutex+0x5/0x14 Mar 4 09:43:47 oss4 kernel: [<ffffffff8002d201>] shrink_icache_memory+0x40/0x1e6 Mar 4 09:43:47 oss4 kernel: [<ffffffff8003e778>] shrink_slab+0xdc/0x153 Mar 4 09:43:47 oss4 kernel: [<ffffffff800c2cd7>] try_to_free_pages+0x189/0x275 Mar 4 09:43:47 oss4 kernel: [<ffffffff8000efd1>] __alloc_pages+0x1a8/0x2ab Mar 4 09:43:47 oss4 kernel: [<ffffffff80017026>] cache_grow+0x137/0x395 (...etc to end of kernel panic dump) Mar 4 09:43:48 oss4 kernel: LustreError: dumping log to /tmp/lustre-log.1236177827.4206 A "lctl dl" after rebooting the computer: [r...@oss4 log]# lctl lctl > dl 0 UP mgc mgc192.168.64....@o2ib 5df96fa8-528f-de53-c2e1-d4db598b057d 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter crew8-OST0000 crew8-OST0000_UUID 11 3 UP obdfilter crew8-OST0001 crew8-OST0001_UUID 11 4 UP obdfilter crew8-OST0002 crew8-OST0002_UUID 11 5 UP obdfilter crew8-OST0003 crew8-OST0003_UUID 11 6 UP obdfilter crew8-OST0004 crew8-OST0004_UUID 11 7 UP obdfilter crew8-OST0005 crew8-OST0005_UUID 11 8 UP obdfilter crew8-OST0006 crew8-OST0006_UUID 11 9 UP obdfilter crew8-OST0007 crew8-OST0007_UUID 11 10 UP obdfilter crew8-OST0008 crew8-OST0008_UUID 11 11 UP obdfilter crew8-OST0009 crew8-OST0009_UUID 11 12 UP obdfilter crew8-OST000a crew8-OST000a_UUID 11 13 UP obdfilter crew8-OST000b crew8-OST000b_UUID 11 The computer comes up normally without errors. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
