We had an LBUG on our MDS (on 15th Feb) and so attempted a failover to
the 2nd MGS/MDS server. This mounted the MGT fine but hung while
mounting the MDT (longer than 5 minutes).
To resolve the problem I unmounted the MGT and the MDT on a freshly
booted MDS/MGS and mounted the MDT as ldiskfs. Then moved aside the
CATALOGS, OBJECTS and last_rcvd files/dirs, unmounted and restarted
lustre (mount -t lustre ....)
This brought the file system back ok but one of our scientists appears
to have lost an entire directory of data from the time the file system
was taken down. The MDS was initally taken out at 1400 (16 Feb) and the
file system was fully back around 1500. The scientist has files in the
directory from 1400 onwards.
Approximately 4000 small files dating from the start of January are
missing. We are running 1.6.6 with a patched kernel 2.6.18-92.1.10.el5
on the servers, the client is running an unreleased patchless RH kernel
2.6.18-171.el5 and 1.6.7.2 lustre modules.
We should have good backups of our metadata and we also have access to
the removed ldiskfs files which were simply renamed. The missing files
have fairly predictable names which might help tracking down the content?
Is there any hope of recovering the missing files/directory?
GREG
--
Greg Matthews 01235 778658
Senior Computer Systems Administrator
Diamond Light Source, Oxfordshire, UK
Feb 15 09:43:16 cs04r-sc-mds01-01 kernel: Lustre:
15622:0:(ldlm_lib.c:538:target_handle_reconnect()) lustre01-MDT0000:
0395500b-8976-1c4a-3ae6-893f3704cbd2 reconnecting
Feb 15 09:43:16 cs04r-sc-mds01-01 kernel: Lustre:
15622:0:(ldlm_lib.c:773:target_handle_connect()) lustre01-MDT0000: refuse
reconnection from [email protected]@tcp to
0xffff8103a8780000; still busy with 2 active RPCs
Feb 15 09:43:16 cs04r-sc-mds01-01 kernel: LustreError:
15622:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16)
r...@ffff81042408ae00 x1324926638566767/t0
o38->0395500b-8976-1c4a-3ae6-893f3704c...@net_0x20000ac177ccd_uuid:0/0 lens
368/200 e 0 to 0 dl 1266227096 ref 1 fl Interpret:/0/0 rc -16/0
Feb 15 09:43:16 cs04r-sc-mds01-01 kernel: LustreError:
15622:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 5 previous similar
messages
Feb 15 09:43:23 cs04r-sc-mds01-01 kernel: Lustre:
29164:0:(ldlm_lib.c:538:target_handle_reconnect()) lustre01-MDT0000:
0395500b-8976-1c4a-3ae6-893f3704cbd2 reconnecting
Feb 15 09:43:23 cs04r-sc-mds01-01 kernel: Lustre:
29164:0:(ldlm_lib.c:773:target_handle_connect()) lustre01-MDT0000: refuse
reconnection from [email protected]@tcp to
0xffff8103a8780000; still busy with 2 active RPCs
Feb 15 09:43:23 cs04r-sc-mds01-01 kernel: Lustre:
15690:0:(service.c:1088:ptlrpc_server_handle_request()) @@@ Request
x1324926638566755 took longer than estimated (8+8s); client may timeout.
r...@ffff810424a4cc00 x1324926638566755/t3070769330
o101->0395500b-8976-1c4a-3ae6-893f3704c...@net_0x20000ac177ccd_uuid:0/0 lens
976/608 e 0 to 0 dl 1266226995 ref 1 fl Complete:/0/0 rc 301/301
Feb 15 09:43:31 cs04r-sc-mds01-01 kernel: Lustre:
28857:0:(ldlm_lib.c:538:target_handle_reconnect()) lustre01-MDT0000:
0395500b-8976-1c4a-3ae6-893f3704cbd2 reconnecting
Feb 15 09:44:45 cs04r-sc-mds01-01 kernel: Lustre:
15690:0:(ldlm_lib.c:538:target_handle_reconnect()) lustre01-MDT0000:
79d8374e-19a0-66b9-8cb1-a059d55cd9f9 reconnecting
Feb 15 09:44:45 cs04r-sc-mds01-01 kernel: Lustre:
15690:0:(ldlm_lib.c:773:target_handle_connect()) lustre01-MDT0000: refuse
reconnection from [email protected]@tcp to
0xffff8103874bc000; still busy with 2 active RPCs
Feb 15 09:44:45 cs04r-sc-mds01-01 kernel: LustreError:
15690:0:(ldlm_lib.c:1619:target_send_reply_msg()) @@@ processing error (-16)
r...@ffff810213753a00 x16107810/t0
o38->79d8374e-19a0-66b9-8cb1-a059d55cd...@net_0x20000ac17622d_uuid:0/0 lens
304/200 e 0 to 0 dl 1266227185 ref 1 fl Interpret:/0/0 rc -16/0
Feb 15 09:44:45 cs04r-sc-mds01-01 kernel: LustreError:
15690:0:(ldlm_lib.c:1619:target_send_reply_msg()) Skipped 1 previous similar
message
Feb 15 09:45:10 cs04r-sc-mds01-01 kernel: Lustre:
26895:0:(ldlm_lib.c:538:target_handle_reconnect()) lustre01-MDT0000:
79d8374e-19a0-66b9-8cb1-a059d55cd9f9 reconnecting
Feb 15 09:45:10 cs04r-sc-mds01-01 kernel: Lustre:
26895:0:(ldlm_lib.c:773:target_handle_connect()) lustre01-MDT0000: refuse
reconnection from [email protected]@tcp to
0xffff8103874bc000; still busy with 2 active RPCs
Feb 15 09:45:18 cs04r-sc-mds01-01 kernel: Lustre:
26885:0:(service.c:1088:ptlrpc_server_handle_request()) @@@ Request x16106827
took longer than estimated (100+33s); client may timeout.
r...@ffff8101f5f8e800 x16106827/t3070786830
o101->79d8374e-19a0-66b9-8cb1-a059d55cd...@net_0x20000ac17622d_uuid:0/0 lens
512/536 e 0 to 0 dl 1266227085 ref 1 fl Complete:/0/0 rc 0/0
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: Lustre:
29148:0:(ldlm_lib.c:538:target_handle_reconnect()) lustre01-MDT0000:
79d8374e-19a0-66b9-8cb1-a059d55cd9f9 reconnecting
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: Lustre: Child 108536598/2706510079
lookup error -13. Evicting client 79d8374e-19a0-66b9-8cb1-a059d55cd9f9 with
export 172.23.98...@tcp.
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: LustreError:
15884:0:(handler.c:1590:mds_handle()) operation 35 on unconnected MDS from
12345-172.23.98...@tcp
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: LustreError:
27093:0:(handler.c:2590:mds_intent_policy()) ASSERTION(new_lock != NULL)
failed:op 0x1 lockh 0x9ad0e27cd64f91f4
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: LustreError:
27093:0:(handler.c:2590:mds_intent_policy()) LBUG
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: Lustre:
27093:0:(linux-debug.c:185:libcfs_debug_dumpstack()) showing stack for process
27093
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: ll_mdt_104 R running task 0
27093 1 27132 27092 (L-TLB)
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: ffff8104258b7430 0000000000000046
0000000000000046 ffff8104258b7490
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: 0000000000000000 0000000000000009
ffff8103a1ef07e0 ffff810429843860
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: 00147a26e5df7873 000000000002e2f2
ffff8103a1ef09d0 00000007472e8140
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: Call Trace:
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff800d3a89>]
kmem_freepages+0xe6/0x110
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff8001a0ba>]
vsnprintf+0x559/0x59e
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff80019e9c>]
vsnprintf+0x33b/0x59e
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff88697548>]
:libcfs:libcfs_debug_vmsg2+0x6c8/0x970
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff801437a4>]
__next_cpu+0x19/0x28
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff800756b4>]
smp_send_reschedule+0x4e/0x53
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff800468bf>]
try_to_wake_up+0x407/0x418
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff8008f793>]
__call_console_drivers+0x5b/0x69
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff8008f793>]
__call_console_drivers+0x5b/0x69
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff80016c91>]
release_console_sem+0x1ba/0x20e
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff8008ffa3>] printk+0x52/0xbd
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff88697548>]
:libcfs:libcfs_debug_vmsg2+0x6c8/0x970
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff88697548>]
:libcfs:libcfs_debug_vmsg2+0x6c8/0x970
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff800a57c2>]
kallsyms_lookup+0x18a/0x1ae
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff8006b7f1>]
printk_address+0x9f/0xab
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff800a3281>]
module_text_address+0x33/0x3c
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff8009c526>]
kernel_text_address+0x1a/0x26
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff8006b4d7>]
dump_trace+0x211/0x23a
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff8006b534>]
show_trace+0x34/0x47
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff8006b639>]
_show_stack+0xdb/0xea
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff8868fc2a>]
:libcfs:lbug_with_loc+0x7a/0xc0
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff88a86dd5>]
:mds:mds_intent_policy+0x8e5/0xc30
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff886dd50c>]
:lnet:LNetMDBind+0x2ac/0x400
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff887a6156>]
:ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff887a3916>]
:ptlrpc:ldlm_lock_enqueue+0x186/0x990
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff887a073d>]
:ptlrpc:ldlm_lock_create+0x9ad/0x9e0
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff887c54d0>]
:ptlrpc:ldlm_server_completion_ast+0x0/0x5c0
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff887c2de5>]
:ptlrpc:ldlm_handle_enqueue+0xca5/0x12a0
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff887b0833>]
:ptlrpc:target_send_reply+0x3b3/0x3f0
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff887c5a90>]
:ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff88a8b155>]
:mds:mds_handle+0x4035/0x4cf0
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff801437a4>]
__next_cpu+0x19/0x28
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff801437a4>]
__next_cpu+0x19/0x28
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff80089aa8>]
find_busiest_group+0x20d/0x621
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff8873e031>]
:obdclass:class_handle2object+0xd1/0x160
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff887dd705>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff887e70da>]
:ptlrpc:ptlrpc_check_req+0x1a/0x110
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff887e92c2>]
:ptlrpc:ptlrpc_server_handle_request+0x992/0x1040
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff80062f4b>]
thread_return+0x0/0xdf
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff8006d940>]
do_gettimeofday+0x50/0x92
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff88698476>]
:libcfs:lcw_update_time+0x16/0x100
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff800893bb>]
__wake_up_common+0x3e/0x68
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff887ec22c>]
:ptlrpc:ptlrpc_main+0xe0c/0xf90
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff8008ad7e>]
default_wake_function+0x0/0xe
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff800b4610>]
audit_syscall_exit+0x31b/0x336
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff8005dfb1>]
child_rip+0xa/0x11
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff887eb420>]
:ptlrpc:ptlrpc_main+0x0/0xf90
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: [<ffffffff8005dfa7>]
child_rip+0x0/0x11
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel:
Feb 15 09:45:35 cs04r-sc-mds01-01 kernel: LustreError: dumping log to
/tmp/lustre-log.1266227135.27093
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: Lustre: 0:0:(watchdog.c:148:lcw_cb())
Watchdog triggered for pid 27093: it was inactive for 200s
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: Lustre:
0:0:(linux-debug.c:185:libcfs_debug_dumpstack()) showing stack for process 27093
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: ll_mdt_104 D ffff81026a166c00
0 27093 1 27132 27092 (L-TLB)
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: ffff8104258b7a10 0000000000000046
0000000000000000 ffffffff80450560
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: ffff8104258b79d0 000000000000000a
ffff8103a1ef07e0 ffff8102472e9040
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: 00147a26f0f9c54c 000000000000263c
ffff8103a1ef09c8 0000000700000a1e
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: Call Trace:
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff8008ad7e>]
default_wake_function+0x0/0xe
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff8868fc6b>]
:libcfs:lbug_with_loc+0xbb/0xc0
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff88a86dd5>]
:mds:mds_intent_policy+0x8e5/0xc30
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff886dd50c>]
:lnet:LNetMDBind+0x2ac/0x400
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff887a6156>]
:ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff887a3916>]
:ptlrpc:ldlm_lock_enqueue+0x186/0x990
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff887a073d>]
:ptlrpc:ldlm_lock_create+0x9ad/0x9e0
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff887c54d0>]
:ptlrpc:ldlm_server_completion_ast+0x0/0x5c0
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff887c2de5>]
:ptlrpc:ldlm_handle_enqueue+0xca5/0x12a0
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff887b0833>]
:ptlrpc:target_send_reply+0x3b3/0x3f0
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff887c5a90>]
:ptlrpc:ldlm_server_blocking_ast+0x0/0x6b0
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff88a8b155>]
:mds:mds_handle+0x4035/0x4cf0
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff801437a4>]
__next_cpu+0x19/0x28
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff801437a4>]
__next_cpu+0x19/0x28
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff80089aa8>]
find_busiest_group+0x20d/0x621
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff8873e031>]
:obdclass:class_handle2object+0xd1/0x160
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff887dd705>]
:ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff887e70da>]
:ptlrpc:ptlrpc_check_req+0x1a/0x110
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff887e92c2>]
:ptlrpc:ptlrpc_server_handle_request+0x992/0x1040
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff80062f4b>]
thread_return+0x0/0xdf
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff8006d940>]
do_gettimeofday+0x50/0x92
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff88698476>]
:libcfs:lcw_update_time+0x16/0x100
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff800893bb>]
__wake_up_common+0x3e/0x68
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff887ec22c>]
:ptlrpc:ptlrpc_main+0xe0c/0xf90
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff8008ad7e>]
default_wake_function+0x0/0xe
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff800b4610>]
audit_syscall_exit+0x31b/0x336
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff8005dfb1>]
child_rip+0xa/0x11
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff887eb420>]
:ptlrpc:ptlrpc_main+0x0/0xf90
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: [<ffffffff8005dfa7>]
child_rip+0x0/0x11
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel:
Feb 15 09:48:55 cs04r-sc-mds01-01 kernel: LustreError: dumping log to
/tmp/lustre-log.1266227335.27093
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss