We are having a problem with a MDS server (which also has 1 OST) on the box.
When the server boots up, we notice there is an ll_mdt process running at 100% and we keep on waiting close to 10-15 mins. We only have 8 clients. (I assume this normal recovery process). However if I manually mount up the mdt without any recovery everything is fine mount -t lustre /dev/foo -o abort_recov /mnt/lustre BUT the server crashes again after 18-24 hours. I am trying to get to the bottom of this crash. I am not sure whats causing the problem and hopefully I am doing something foolish. There are 2 OSTs connecting to this MDS. MDS Server Version: Redhat 5.1-1.2 Running, 2.6.18-92.1.17.el5_lustre.1.6.7smp cat /proc/fs/lustre/version lustre: 1.6.7 kernel: patchless_client build: 1.6.7-19691231170000-PRISTINE-.cache.build.BUILD.lustre-kernel-2.6.18.lustre.linux-2.6.18-92.1.17.el5_lustre.1.6.7smp client# lfs check mds lfs002-MDT0000-mdc-ffff81102ac40000 active. lfs002-MDT0000-mdc-ffff810fd264bc00 active. client# lfs check osts lfs002-OST0000-osc-ffff81102ac40000 active. lfs002-OST0001-osc-ffff81102ac40000 active. lfs002-OST0000-osc-ffff810fd264bc00 active. lfs002-OST0001-osc-ffff810fd264bc00 active. lfs002-OST0002-osc-ffff810fd264bc00 active. lfs002-OST0003-osc-ffff810fd264bc00 active. lfs002-OST0004-osc-ffff810fd264bc00 active. lfs002-OST0005-osc-ffff810fd264bc00 active. mds# lctl dl 0 UP mgs MGS MGS 25 1 UP mgc mgc141.128.90....@tcp b6d875c0-6b30-5a2d-92d3-600ef3324c50 5 2 UP mdt MDS MDS_uuid 3 3 UP lov lfs002-mdtlov lfs002-mdtlov_UUID 4 4 UP mds lfs002-MDT0000 lfs002-MDT0000_UUID 21 5 UP osc lfs002-OST0000-osc lfs002-mdtlov_UUID 5 6 UP osc lfs002-OST0001-osc lfs002-mdtlov_UUID 5 7 UP ost OSS OSS_uuid 3 8 UP obdfilter lfs002-OST0001 lfs002-OST0001_UUID 23 The clients are running: Redhat 5.2 2.6.18-92.1.10.el5 cat /proc/fs/lustre/version lustre: 1.6.6 kernel: patchless build: 1.6.6-19691231190000-PRISTINE-.usr.src.linux-2.6.18-92.1.10.el5 Mar 12 10:11:02 protected_host_01 kernel: Pid: 10375, comm: ll_mdt_10 Tainted: G 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 Mar 12 10:11:02 protected_host_01 kernel: RIP: 0010:[<ffffffff888ed8df>] [<ffffffff888ed8df>] :ldiskfs:do_split+0x3ef/0x560 Mar 12 10:11:02 protected_host_01 kernel: RSP: 0018:ffff8103d2a5f460 EFLAGS: 00000216 Mar 12 10:11:02 protected_host_01 kernel: RAX: 0000000000000000 RBX: 0000000000000080 RCX: 0000000000000000 Mar 12 10:11:02 protected_host_01 kernel: RDX: 0000000000000080 RSI: ffff8103cd52177c RDI: ffff8103cd52176c Mar 12 10:11:02 protected_host_01 kernel: RBP: ffffffff8000b071 R08: ffff8103cd5216ec R09: 00000000010a0014 Mar 12 10:11:02 protected_host_01 kernel: R10: 00007a6700000008 R11: 00007a672e767363 R12: 000000000064dc69 Mar 12 10:11:02 protected_host_01 kernel: R13: ffffffff80019496 R14: ffff81040ed0f4c0 R15: 0000000000000000 Mar 12 10:11:02 protected_host_01 kernel: FS: 00002b7545c3b220(0000) GS:ffff81042fea79c0(0000) knlGS:0000000000000000 Mar 12 10:11:02 protected_host_01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Mar 12 10:11:02 protected_host_01 kernel: CR2: 0000003d222c5cb0 CR3: 0000000000201000 CR4: 00000000000006e0 Mar 12 10:11:02 protected_host_01 kernel: Mar 12 10:11:02 protected_host_01 kernel: Call Trace: Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff888ee3b5>] :ldiskfs:ldiskfs_add_entry+0x4f5/0x980 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff88034f74>] :jbd:journal_dirty_metadata+0x1b5/0x1e3 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff889a6840>] :mds:mds_get_parent_child_locked+0x750/0x8e0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff888eee56>] :ldiskfs:ldiskfs_add_nondir+0x26/0x90 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff888ef776>] :ldiskfs:ldiskfs_create+0xf6/0x140 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8896f412>] :fsfilt_ldiskfs:fsfilt_ldiskfs_start+0x562/0x630 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8003a075>] vfs_create+0xe6/0x158 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff889c7140>] :mds:mds_open+0x14b0/0x317e Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8002e15a>] __wake_up+0x38/0x4f Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8876c241>] :ksocklnd:ksocknal_queue_tx_locked+0x4f1/0x550 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8876d47f>] :ksocklnd:ksocknal_launch_packet+0x2df/0x3d0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff889a1f49>] :mds:mds_reint_rec+0x1d9/0x2b0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff889cad82>] :mds:mds_open_unpack+0x312/0x430 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff88994d4a>] :mds:mds_reint+0x35a/0x420 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff889934db>] :mds:fixup_handle_for_resent_req+0x25b/0x2c0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff88998dfc>] :mds:mds_intent_policy+0x48c/0xc30 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886ab526>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886a8d18>] :ptlrpc:ldlm_lock_enqueue+0x188/0x990 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886c36ff>] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8862c688>] :obdclass:lustre_hash_add+0x208/0x2d0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886cc2a0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x833 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886ca3f9>] :ptlrpc:ldlm_handle_enqueue+0xc09/0x1200 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8899d615>] :mds:mds_handle+0x4075/0x4d30 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff800d40d5>] cache_flusharray+0x2f/0xa3 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff800898e3>] find_busiest_group+0x20d/0x621 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886e65a5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886eecfa>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886f0b7d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886f3103>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff80062f4b>] thread_return+0x0/0xdf Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff885967c6>] :libcfs:lcw_update_time+0x16/0x100 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff800891f6>] __wake_up_common+0x3e/0x68 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886f65f8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8008abb9>] default_wake_function+0x0/0xe Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff886f53e0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Mar 12 10:11:02 protected_host_01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Mar 12 10:11:02 protected_host_01 kernel: Mar 12 10:17:06 protected_host_01 kernel: BUG: soft lockup - CPU#6 stuck for 10s! [ll_mdt_10:10375] Mar 12 10:17:06 protected_host_01 kernel: CPU 6: Mar 12 10:17:06 protected_host_01 kernel: Modules linked in: obdfilter(U) ost(U) mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) mptctl(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) nfsd(U) exportfs(U) auth_rpcgss(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) autofs4(U) sunrpc(U) bonding(U) dm_round_robin(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sg(U) pata_acpi(U) lpfc(U) ide_cd(U) bnx2(U) e1000e(U) cdrom(U) shpchp(U) scsi_transport_fc(U) hpwdt(U) i5000_edac(U) edac_mc(U) pcspkr(U) serio_raw(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) ata_piix(U) sata_nv(U) libata(U) mptsas(U) scsi_transport_sas(U) mptspi(U) mptscsih(U) scsi_transport_spi(U) mptbase(U) cciss(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) Mar 12 10:17:06 protected_host_01 kernel: Pid: 10375, comm: ll_mdt_10 Tainted: G 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 Mar 12 10:17:06 protected_host_01 kernel: RIP: 0010:[<ffffffff888ed8f0>] [<ffffffff888ed8f0>] :ldiskfs:do_split+0x400/0x560 Mar 12 10:17:06 protected_host_01 kernel: RSP: 0018:ffff8103d2a5f460 EFLAGS: 00000246 Mar 12 10:17:06 protected_host_01 kernel: RAX: 0000000000000000 RBX: 0000000000000080 RCX: 0000000000000000 Mar 12 10:17:06 protected_host_01 kernel: RDX: 0000000000000080 RSI: ffff8103cd52177c RDI: ffff8103cd52176c Mar 12 10:17:06 protected_host_01 kernel: RBP: ffffffff8000b071 R08: ffff8103cd5216ec R09: 00000000010a0014 Mar 12 10:17:06 protected_host_01 kernel: R10: 00007a6700000008 R11: 00007a672e767363 R12: 000000000064dc69 Mar 12 10:17:06 protected_host_01 kernel: R13: ffffffff80019496 R14: ffff81040ed0f4c0 R15: 0000000000000000 Mar 12 10:17:06 protected_host_01 kernel: FS: 00002b7545c3b220(0000) GS:ffff81042fea79c0(0000) knlGS:0000000000000000 Mar 12 10:17:06 protected_host_01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Mar 12 10:17:06 protected_host_01 kernel: CR2: 0000003d222c5cb0 CR3: 0000000000201000 CR4: 00000000000006e0 Mar 12 10:17:06 protected_host_01 kernel: Mar 12 10:17:06 protected_host_01 kernel: Call Trace: Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff888ee3b5>] :ldiskfs:ldiskfs_add_entry+0x4f5/0x980 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff88034f74>] :jbd:journal_dirty_metadata+0x1b5/0x1e3 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff889a6840>] :mds:mds_get_parent_child_locked+0x750/0x8e0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff888eee56>] :ldiskfs:ldiskfs_add_nondir+0x26/0x90 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff888ef776>] :ldiskfs:ldiskfs_create+0xf6/0x140 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8896f412>] :fsfilt_ldiskfs:fsfilt_ldiskfs_start+0x562/0x630 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8003a075>] vfs_create+0xe6/0x158 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff889c7140>] :mds:mds_open+0x14b0/0x317e Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8002e15a>] __wake_up+0x38/0x4f Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8876c241>] :ksocklnd:ksocknal_queue_tx_locked+0x4f1/0x550 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8876d47f>] :ksocklnd:ksocknal_launch_packet+0x2df/0x3d0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff889a1f49>] :mds:mds_reint_rec+0x1d9/0x2b0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff889cad82>] :mds:mds_open_unpack+0x312/0x430 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff88994d4a>] :mds:mds_reint+0x35a/0x420 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff889934db>] :mds:fixup_handle_for_resent_req+0x25b/0x2c0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff88998dfc>] :mds:mds_intent_policy+0x48c/0xc30 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886ab526>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886a8d18>] :ptlrpc:ldlm_lock_enqueue+0x188/0x990 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886c36ff>] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8862c688>] :obdclass:lustre_hash_add+0x208/0x2d0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886cc2a0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x833 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886ca3f9>] :ptlrpc:ldlm_handle_enqueue+0xc09/0x1200 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8899d615>] :mds:mds_handle+0x4075/0x4d30 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff800d40d5>] cache_flusharray+0x2f/0xa3 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff80143809>] __next_cpu+0x19/0x28 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff800898e3>] find_busiest_group+0x20d/0x621 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886e65a5>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886eecfa>] :ptlrpc:ptlrpc_server_request_get+0x6a/0x150 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886f0b7d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886f3103>] :ptlrpc:ptlrpc_server_handle_request+0xa93/0x1150 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff80062f4b>] thread_return+0x0/0xdf Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8006d8a2>] do_gettimeofday+0x40/0x8f Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff885967c6>] :libcfs:lcw_update_time+0x16/0x100 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff800891f6>] __wake_up_common+0x3e/0x68 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886f65f8>] :ptlrpc:ptlrpc_main+0x1218/0x13e0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8008abb9>] default_wake_function+0x0/0xe Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff800b4382>] audit_syscall_exit+0x31b/0x336 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff886f53e0>] :ptlrpc:ptlrpc_main+0x0/0x13e0 Mar 12 10:17:06 protected_host_01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Any thoughts? TIA _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
