Were seeing problems which seem to be lock-related and result in data loss. This is a fairly low probability event: it's happening on a ~1000 core compute farm which is heavily loaded and the frequency is of the order of tens of failures an hour.
The clients are running kernel 2.6.22 and lustre 1.6.6 from Sun. The servers are HP SFS and running lustre 1.4.11. The mixed versions are deliberate: we're in the process of adding servers running 1.6.6 but none are currently in use: all servers are 1.4.11. 1.4.11 clients don't see problems and there are some old 1.4.11 clients left still using these filesystems at teh same time as the new 1.6.6 ones. On the server (OST), and event typically looks like this: 2009/02/09 14:05:10 kern i sf2-sfs33: LustreError: 0:0:(ldlm_lockd.c:227:waiting_locks_callback()) ### lock callback timer expired: evicting client 4cdd29e6-8b9e-7e81-5d72-2a055f0e2...@net_0x20000ac1f6eee_uuid nid 172.31.110....@tcp ns: filter-sf2-sfs-ost495_UUID lock: 0000010048ae0100/0x5fd2cf129ee3e5bb lrc: 1/0 , 0 mode: PW/PW res: 1542868/0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 20 remote: 0xc8fbbbbcb3ab0d8d expref: 386 pid: 16626 2009/02/09 14:05:30 kern i sf2-sfs33: LustreError: 16577:0:(ldlm_lib.c:1364:target_send_reply_msg()) @@@ processing error (-107) r...@0000010073bbce00 x2138818/t0 o400-><?>@<?>:-1 lens 64/0 ref 0 fl Interpret:/0/0 rc -107/0 and on the client at the same time: Feb 9 14:05:30 sf-2-3-10 kernel: LustreError: 11-0: an error occurred while communicating with 172.31.96...@tcp. The obd_ping operation failed with -107 Feb 9 14:05:30 sf-2-3-10 kernel: LustreError: Skipped 12 previous similar messages Feb 9 14:05:30 sf-2-3-10 kernel: Lustre: OSC_sf2-sfs2.internal.sanger.ac.uk_sf2-sfs-ost495_MNT_client_tcp-ffff81021f897800: Connection to service sf2-sfs-ost495 via nid 172.31.96...@tcp was lost; in progress operations using this service will wait for recovery to complete. Feb 9 14:05:30 sf-2-3-10 kernel: Lustre: Skipped 4 previous similar messages Feb 9 14:05:30 sf-2-3-10 kernel: LustreError: 167-0: This client was evicted by sf2-sfs-ost495; in progress operations using this service will fail. It gets worse: the same client s few minutes later saw this on another OST, same filesystem. 2009/02/09 14:08:30 kern i sf2-sfs34: LustreError: 0:0:(ldlm_lockd.c:227:waiting_locks_callback()) ### lock callback timer expired: evicting client 4cdd29e6-8b9e-7e81-5d72-2a055f0e2...@net_0x20000ac1f6eee_uuid nid 172.31.110....@tcp ns: filter-sf2-sfs-ost496_UUID lock: 000001006b8bfa40/0x900da6275f064529 lrc: 1/0 , 0 mode: PW/PW res: 1542328/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 20 remote: 0xc8fbbbbcb3ab1478 expref: 172 pid: 16089 2009/02/09 14:08:30 kern i sf2-sfs33: LustreError: 0:0:(ldlm_lockd.c:227:waiting_locks_callback()) ### lock callback timer expired: evicting client f3275d1e-1360-e139-e71c-44b283d7e...@net_0x20000ac1f665d_uuid nid 172.31.102...@tcp ns: filter-sf2-sfs-ost495_UUID lock: 000001005a810040/0x5fd2cf129ee3e9f1 lrc: 1/0 , 0 mode: PW/PW res: 1542868/0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 20 remote: 0xa899b6e20a06a83e expref: 316 pid: 16454 2009/02/09 14:08:30 kern i sf2-sfs33: LustreError: 16335:0:(ldlm_lockd.c:606:ldlm_server_completion_ast()) ### enqueue wait took 399491082us from 1234188111 ns: filter-sf2-sfs-ost495_UUID lock: 0000010064a638c0/0x5fd2cf129ee3ea68 lrc: 2/0 , 0 mode: PW/PW res: 1542868/0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 20 remote: 0xc8423e042b195e36 expref: 501 pid: 16493 2009/02/09 14:08:32 storage w sf2-sfs1 : Warning: event log (/var/hpls/evlog/eventlog) size is 77760824 2009/02/09 14:08:50 kern i sf2-sfs34: LustreError: 16253:0:(ldlm_lib.c:1364:target_send_reply_msg()) @@@ processing error (-107) r...@00000100b3ea4a00 x2150055/t0 o400-><?>@<?>:-1 lens 64/0 ref 0 fl Interpret:/0/0 rc -107/0 2009/02/09 14:09:02 kern i sf2-sfs33: LustreError: 16566:0:(ldlm_lib.c:1364:target_send_reply_msg()) @@@ processing error (-107) r...@000001006b7b3200 x5180156/t0 o400-><?>@<?>:-1 lens 64/0 ref 0 fl Interpret:/0/0 rc -107/0 whilst the client suffered an assertion failure and LBUG. Feb 9 14:08:50 sf-2-3-10 kernel: LustreError: 11-0: an error occurred while communicating with 172.31.96...@tcp. The obd_ping operation failed with -107 Feb 9 14:08:50 sf-2-3-10 kernel: Lustre: OSC_sf2-sfs2.internal.sanger.ac.uk_sf2-sfs-ost496_MNT_client_tcp-ffff81021f897800: Connection to service sf2-sfs-ost496 via nid 172.31.96...@tcp was lost; in progress operations using this service will wait for recovery to complete. Feb 9 14:08:50 sf-2-3-10 kernel: LustreError: 167-0: This client was evicted by sf2-sfs-ost496; in progress operations using this service will fail. Feb 9 14:08:50 sf-2-3-10 kernel: LustreError: 29564:0:(import.c:241:ptlrpc_invalidate_import()) sf2-sfs-ost495_UUID: rc = -110 waiting for callback (1 != 0) Feb 9 14:08:50 sf-2-3-10 kernel: LustreError: 29564:0:(import.c:252:ptlrpc_invalidate_import()) ASSERTION(atomic_read(&imp->imp_inflight) == 0) failed Feb 9 14:08:50 sf-2-3-10 kernel: Lustre: OSC_sf2-sfs2.internal.sanger.ac.uk_sf2-sfs-ost496_MNT_client_tcp-ffff81021f897800: Connection restored to service sf2-sfs-ost496 using nid 172.31.96...@tcp. Feb 9 14:08:50 sf-2-3-10 kernel: Lustre: Skipped 4 previous similar messages Feb 9 14:08:50 sf-2-3-10 kernel: LustreError: 29564:0:(tracefile.c:450:libcfs_assertion_failed()) LBUG Feb 9 14:08:50 sf-2-3-10 kernel: Lustre: 29564:0:(linux-debug.c:185:libcfs_debug_dumpstack()) showing stack for process 29564 Feb 9 14:08:50 sf-2-3-10 kernel: ll_imp_inval R running task 0 29564 2 (L-TLB) Feb 9 14:08:50 sf-2-3-10 kernel: ffffffff804b1d28 ffffffff804b1d28 ffffffff804b1d28 0000000000000000 Feb 9 14:08:50 sf-2-3-10 kernel: ffffffff8856cf5f ffff8101a231fcc0 0000003000000030 ffff8101b8079e90 Feb 9 14:08:50 sf-2-3-10 kernel: ffff8101b8079d70 ffffffff8025d587 0000000000000286 ffffffff8025d5a9 Feb 9 14:08:50 sf-2-3-10 kernel: Call Trace: Feb 9 14:08:50 sf-2-3-10 kernel: [<ffffffff8020b2f0>] show_trace+0x34/0x47 Feb 9 14:08:50 sf-2-3-10 kernel: [<ffffffff8020b3fc>] _show_stack+0xe2/0xf1 Feb 9 14:08:50 sf-2-3-10 kernel: [<ffffffff882f7b7c>] :libcfs:lbug_with_loc+0x6c/0xd0 Feb 9 14:08:50 sf-2-3-10 kernel: [<ffffffff882ffbc4>] :libcfs:libcfs_assertion_failed+0x54/0x60 Feb 9 14:08:50 sf-2-3-10 kernel: [<ffffffff88451785>] :ptlrpc:ptlrpc_invalidate_import+0x4d5/0x5c0 Feb 9 14:08:50 sf-2-3-10 kernel: [<ffffffff8022c500>] default_wake_function+0x0/0xe Feb 9 14:08:50 sf-2-3-10 kernel: [<ffffffff80292e8c>] dput+0x1c/0x10b Feb 9 14:08:50 sf-2-3-10 kernel: [<ffffffff88453ce9>] :ptlrpc:ptlrpc_invalidate_import_thread+0x49/0x2e0 Feb 9 14:08:50 sf-2-3-10 kernel: [<ffffffff8020aba8>] child_rip+0xa/0x12 Feb 9 14:08:50 sf-2-3-10 kernel: [<ffffffff88453ca0>] :ptlrpc:ptlrpc_invalidate_import_thread+0x0/0x2e0 Feb 9 14:08:50 sf-2-3-10 kernel: [<ffffffff8020ab9e>] child_rip+0x0/0x12 Feb 9 14:08:50 sf-2-3-10 kernel: Userspace file operations terminate with an IO error when these events happen. The failed operations seem to be writes to files which are long-open (spool files taking job output). We are also seeing some userspace file operations fail with the error "No locks available". These don't generate any logging on the client so I don't have exact timing. It's possible that they are associated with further "### lock callback timer expired" server logs. OSTs have the lustre timeout set to 200 seconds. System load is not excessive and no greater than has been supported for more than a year with 1.4 clients. I'd be very happy to have any clues on fixing this.... Cheers, Simon. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
