Hi Michael, This looks like the problem we had some time ago after upgrading to 1.8.3
https://bugzilla.lustre.org/show_bug.cgi?id=22889 Best regards Wojciech On 20 July 2010 00:00, Michael Sternberg <[email protected]> wrote: > Hello, > > I use OSSs with external journal partitions and since lustre-1.8.1 about > one to two times a week I get frustrating panics on OSSs as follows: > > :libcfs:cfs_alloc ... > :lvfs:lprocfs_counter_add ... > ... > > RIP [<ffffffff88031e64>] :jbd:journal_dirty_metadata+0x7f/0x1e3 > RSP <ffff8101f99c3670> > <0>Kernel panic - not syncing: Fatal exception > > (full graphical screenshot attached, hoping it passes through) > > Clients sometimes report: > > Message from syslogd@ at Mon Jul 19 04:11:46 2010 ... > login4 kernel: journal commit I/O error > > > I have recently updated to 1.8.3, where I e2fsck'd and re-initialized the > external journals, but still get those panics. I use 2 OSS with heartbeat > failover, each one "owns" and normally serves 2 OSTs (4 OST total), coming > from 4 LUNs on a RAID unit with dual controllers. All OSTs use ldiskfs > (pre-ext4 proper, if I understand correctly) with the journals located on > partitions of 2 further LUNs. I use a variant of the script at bug 20807 to > account for different device numbers of the external journals on the two > OSSs. Failover usually works, eventually, after a load peak of up to 100 on > the OSS taking over, and messages about hung threads (see below). > > > Is there anything I could do besides giving up on external journals? My > data stores are RAID1, and the journal disks are a single pair of disks also > in RAID1. > > I had difficulties locating further information on googling > "journal_dirty_metadata" pertaining to lustre/ldiskfs specifically. There > are old discussions at: > > https://bugzilla.redhat.com/show_bug.cgi?id=183119 2007/2008 - > kernel 2.4.7 > > http://oss.oracle.com/pipermail/ocfs2-users/2010-January/004113.html > (ahem) > > > With best regards, > Michael > > > [r...@mds01 ~]# cat /proc/fs/lustre/version > lustre: 1.8.3 > kernel: patchless_client > build: 1.8.3-20100409182943-PRISTINE-2.6.18-164.11.1.el5_lustre.1.8.3 > > [r...@mds01 ~]# uname -r > 2.6.18-164.11.1.el5_lustre.1.8.3 > > [r...@mds01 ~]# lctl dl > 0 UP mgs MGS MGS 725 > 1 UP mgc mgc172.17.12...@o2ib 9642cdcd-4955-ca05-4e85-8a9f6d10c027 5 > 2 UP mdt MDS MDS_uuid 3 > 7 UP lov sandbox-mdtlov sandbox-mdtlov_UUID 4 > 8 UP mds sandbox-MDT0000 sandbox-MDT0000_UUID 719 > 9 UP osc sandbox-OST0000-osc sandbox-mdtlov_UUID 5 > 10 UP osc sandbox-OST0001-osc sandbox-mdtlov_UUID 5 > > [r...@mds01 ~]# ssh mds02 lctl dl > 0 UP mgc mgc172.17.12...@o2ib 12c07f8c-f1e7-f739-9983-3c3aa2ec492a 5 > 1 UP mdt MDS MDS_uuid 3 > 2 UP lov carbonfs-mdtlov carbonfs-mdtlov_UUID 4 > 3 UP mds carbonfs-MDT0000 carbonfs-MDT0000_UUID 719 > 4 UP osc carbonfs-OST0001-osc carbonfs-mdtlov_UUID 5 > 5 UP osc carbonfs-OST0000-osc carbonfs-mdtlov_UUID 5 > > [r...@oss01 ~]# lctl dl > 0 UP mgc mgc172.17.12...@o2ib a89ba7f9-2a12-ff77-0321-151a1addf043 5 > 1 UP ost OSS OSS_uuid 3 > 2 UP obdfilter sandbox-OST0000 sandbox-OST0000_UUID 721 > 3 UP obdfilter carbonfs-OST0000 carbonfs-OST0000_UUID 721 > 4 UP obdfilter sandbox-OST0001 sandbox-OST0001_UUID 721 > 5 UP obdfilter carbonfs-OST0001 carbonfs-OST0001_UUID 721 > > [client]# lctl dl > 0 UP mgc mgc172.17.12...@o2ib dadd88bf-fbad-d933-b02a-a539fd8abfea 5 > 1 UP lov sandbox-clilov-ffff8101da93c400 > 30094b3a-b246-e667-ef8a-f6690e4d051c 4 > 2 UP mdc sandbox-MDT0000-mdc-ffff8101da93c400 > 30094b3a-b246-e667-ef8a-f6690e4d051c 5 > 3 UP osc sandbox-OST0000-osc-ffff8101da93c400 > 30094b3a-b246-e667-ef8a-f6690e4d051c 5 > 4 UP osc sandbox-OST0001-osc-ffff8101da93c400 > 30094b3a-b246-e667-ef8a-f6690e4d051c 5 > 5 UP lov carbonfs-clilov-ffff81041dc0f800 > 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 4 > 6 UP mdc carbonfs-MDT0000-mdc-ffff81041dc0f800 > 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5 > 7 UP osc carbonfs-OST0001-osc-ffff81041dc0f800 > 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5 > 8 UP osc carbonfs-OST0000-osc-ffff81041dc0f800 > 4aaa2ddd-eee8-564f-ea48-b9c36a428eb9 5 > > > -------------------------------------------------- > "thread hung" messages: > -------------------------------------------------- > Jul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: Recovery period > over after 1:05, of 359 clients 358 recovered and 0 were evicted. > Jul 19 04:01:05 oss01 kernel: Lustre: carbonfs-OST0001: sending delayed > replies to recovered clientsJul 19 04:01:05 oss01 kernel: Lustre: > carbonfs-OST0001: received MDS connection from 172.17.12...@o2ib > Jul 19 04:03:36 oss01 kernel: LustreError: > 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer > expired after 151s: evicting client at 172.17.0....@o2ib ns: > filter-carbonfs-OST0001_UUID loc > k: ffff810178b8d600/0x80a4d28c4aff67ec lrc: 3/0,0 mode: PW/PW res: 152472/0 > rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) > flags: 0x10020 remote: 0x4701b34288b619f7 expref: 12 pid: 68 > 06 timeout 4356934523 > Jul 19 04:04:16 oss01 kernel: Lustre: > 6757:0:(ldlm_lib.c:804:target_handle_connect()) carbonfs-OST0001: exp > ffff8101ecf94e00 already connectingJul 19 04:04:16 oss01 kernel: Lustre: > 6757:0:(ldlm_lib.c:804:target_handle_connect()) Skipped 38 previous similar > messages > Jul 19 04:04:25 oss01 kernel: Lustre: Service thread pid 6145 was inactive > for 200.00s. The thread might be hung, or it might only be slow and will > resume later. Dumping the stack trace for debugging purposes: > Jul 19 04:04:25 oss01 kernel: Lustre: Skipped 2 previous similar messages > Jul 19 04:04:25 oss01 kernel: Lustre: > 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for process > 6145 > Jul 19 04:04:25 oss01 kernel: Lustre: > 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) Skipped 2 previous similar > messagesJul 19 04:04:25 oss01 kernel: ll_ost_io_38 D 0000000000000000 0 > 6145 1 6146 6144 (L-TLB) > Jul 19 04:04:25 oss01 kernel: ffff810214dcb730 0000000000000046 > ffff81021158b220 ffff81017fe44cc0 > Jul 19 04:04:25 oss01 kernel: ffff81017fe44cc0 000000000000000a > ffff810214d4a7e0 ffff810214c0f7e0 > Jul 19 04:04:25 oss01 kernel: 000038708ddb5ba3 0000000000001a95 > ffff810214d4a9c8 000000042fe56000 > Jul 19 04:04:25 oss01 kernel: Call Trace:Jul 19 04:04:25 oss01 kernel: > [<ffffffff88036769>] :jbd:log_wait_commit+0xa3/0xf5 > Jul 19 04:04:25 oss01 kernel: [<ffffffff800a00be>] > autoremove_wake_function+0x0/0x2e > Jul 19 04:04:25 oss01 kernel: [<ffffffff88cc900b>] > :fsfilt_ldiskfs:fsfilt_ldiskfs_commit_wait+0xab/0xd0 > Jul 19 04:04:25 oss01 kernel: [<ffffffff88d0798f>] > :obdfilter:filter_commitrw_write+0x19ef/0x2980 > Jul 19 04:04:25 oss01 kernel: [<ffffffff88ca4ed5>] > :ost:ost_checksum_bulk+0x385/0x5b0Jul 19 04:04:26 oss01 kernel: > [<ffffffff88ca4c38>] :ost:ost_checksum_bulk+0xe8/0x5b0 > Jul 19 04:04:26 oss01 kernel: [<ffffffff88cab8da>] > :ost:ost_brw_write+0x1c1a/0x23b0 > Jul 19 04:04:26 oss01 kernel: [<ffffffff889ec238>] > :ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0 > Jul 19 04:04:26 oss01 kernel: [<ffffffff889b7b10>] > :ptlrpc:target_committed_to_req+0x40/0x120 > Jul 19 04:04:26 oss01 kernel: [<ffffffff88ca781d>] > :ost:ost_brw_read+0x189d/0x1a70Jul 19 04:04:26 oss01 kernel: > [<ffffffff889f06c5>] :ptlrpc:lustre_msg_get_version+0x35/0xf0 > Jul 19 04:04:26 oss01 kernel: [<ffffffff8008c86b>] > default_wake_function+0x0/0xe > Jul 19 04:04:26 oss01 kernel: [<ffffffff889f0788>] > :ptlrpc:lustre_msg_check_version_v2+0x8/0x20 > Jul 19 04:04:26 oss01 kernel: [<ffffffff88caec1e>] > :ost:ost_handle+0x2bae/0x53d0 > Jul 19 04:04:26 oss01 kernel: [<ffffffff889cde89>] > :ptlrpc:__ldlm_add_waiting_lock+0x189/0x1c0Jul 19 04:04:26 oss01 kernel: > [<ffffffff889cf10e>] :ptlrpc:ldlm_refresh_waiting_lock+0xbe/0x110 > Jul 19 04:04:26 oss01 kernel: [<ffffffff88ca2571>] > :ost:ost_prolong_locks_iter+0x151/0x180 > Jul 19 04:04:26 oss01 kernel: [<ffffffff889fd9b5>] > :ptlrpc:ptlrpc_server_handle_request+0xaa5/0x1140 > Jul 19 04:04:26 oss01 kernel: [<ffffffff80062ff8>] thread_return+0x62/0xfe > Jul 19 04:04:26 oss01 kernel: [<ffffffff8003dafe>] > lock_timer_base+0x1b/0x3cJul 19 04:04:26 oss01 kernel: [<ffffffff8001ca74>] > __mod_timer+0xb0/0xbe > Jul 19 04:04:26 oss01 kernel: [<ffffffff88a01408>] > :ptlrpc:ptlrpc_main+0x1258/0x1420 > Jul 19 04:04:26 oss01 kernel: [<ffffffff8008c86b>] > default_wake_function+0x0/0xe > Jul 19 04:04:26 oss01 kernel: [<ffffffff800b7076>] > audit_syscall_exit+0x336/0x362 > Jul 19 04:04:26 oss01 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11Jul > 19 04:04:26 oss01 kernel: [<ffffffff88a001b0>] > :ptlrpc:ptlrpc_main+0x0/0x1420 > Jul 19 04:04:26 oss01 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 > Jul 19 04:04:26 oss01 kernel: > -------------------------------------------------- > > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > -- -- Wojciech Turek Assistant System Manager High Performance Computing Service University of Cambridge Email: [email protected] Tel: (+)44 1223 763517
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
