Hi list, One of our OCFS2 servers crashed with this message:
Aug 26 11:33:11 soap01 kernel: Assertion failure in journal_dirty_metadata() at fs/jbd/transaction.c:1114: "handle->h_buffer_credits > 0" Aug 26 11:33:11 soap01 kernel: ----------- [cut here ] --------- [please bite here ] --------- Aug 26 11:33:11 soap01 kernel: Kernel BUG at fs/jbd/transaction.c:1114 Aug 26 11:33:11 soap01 kernel: invalid opcode: 0000 [1] SMP Aug 26 11:33:11 soap01 kernel: last sysfs file: /devices/pci0000:00/0000:00:00.0/irq Aug 26 11:33:11 soap01 kernel: CPU 0 Aug 26 11:33:11 soap01 kernel: Modules linked in: af_packet joydev ocfs2 jbd ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs nfsd exportfs lockd nfs_acl sunrpc ipv6 button battery ac netconsole xt_comment xt_tcpudp xt_state iptable_filter iptable_mangle iptable_nat ip_nat ip_ conntrack nfnetlink ip_tables x_tables apparmor loop st sr_mod usbhid usb_storage hw_random shpchp ide_cd aic7xxx uhci_hcd cdrom pci_hotplug ehci_hcd scsi_transport_spi usbcore bnx2 reiserfs ata_piix ahci libata dm_snapshot qla2xxx firmware_class qla2xxx_conf intermodule edd d m_mod fan thermal processor sg megaraid_sas piix sd_mod scsi_mod ide_disk ide_core Aug 26 11:33:11 soap01 kernel: Pid: 4874, comm: nfsd Tainted: G U 2.6.16.60-0.21-smp #1 Aug 26 11:33:11 soap01 kernel: RIP: 0010:[<ffffffff885e21e0>] <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} Aug 26 11:33:11 soap01 kernel: RSP: 0018:ffff81021e9f1c18 EFLAGS: 00010292 Aug 26 11:33:11 soap01 kernel: RAX: 000000000000006e RBX: ffff8101decf30c0 RCX: 0000000000000292 Aug 26 11:33:11 soap01 kernel: RDX: ffffffff80359968 RSI: 0000000000000296 RDI: ffffffff80359960 Aug 26 11:33:11 soap01 kernel: RBP: ffff81002f753870 R08: ffffffff80359968 R09: ffff810221d3ad80 Aug 26 11:33:11 soap01 kernel: R10: ffff810001035680 R11: 0000000000000070 R12: ffff8101dda21588 Aug 26 11:33:11 soap01 kernel: R13: ffff810207e2fa90 R14: ffff8102277ab400 R15: ffff8100a4dd394c Aug 26 11:33:11 soap01 kernel: FS: 00002b7055e986d0(0000) GS:ffffffff803d3000(0000) knlGS:0000000000000000 Aug 26 11:33:11 soap01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Aug 26 11:33:11 soap01 kernel: CR2: 00002aaaaabdb000 CR3: 000000015e180000 CR4: 00000000000006e0 Aug 26 11:33:11 soap01 kernel: Process nfsd (pid: 4874, threadinfo ffff81021e9f0000, task ffff81021f92e860) Aug 26 11:33:11 soap01 kernel: Stack: ffff81002f753870 ffff8101dda21588 0000000000000000 0000000000000003 Aug 26 11:33:11 soap01 kernel: ffff81018ba52000 ffffffff8862187f 0000000000000000 ffff81018ba52040 Aug 26 11:33:11 soap01 kernel: ffff81007f5163f8 ffffffff8860b38a Aug 26 11:33:11 soap01 kernel: Call Trace: <ffffffff8862187f>{:ocfs2:ocfs2_journal_dirty+106} Aug 26 11:33:11 soap01 kernel: <ffffffff8860b38a>{:ocfs2:__ocfs2_add_entry+745} <ffffffff88628766>{:ocfs2:ocfs2_mknod+1710} Aug 26 11:33:11 soap01 kernel: <ffffffff88628a45>{:ocfs2:ocfs2_mkdir+127} <ffffffff80192b48>{vfs_mkdir+346} Aug 26 11:33:11 soap01 kernel: <ffffffff88522f05>{:nfsd:nfsd_create+753} <ffffffff88529bb2>{:nfsd:nfsd3_proc_mkdir+217} Aug 26 11:33:11 soap01 kernel: <ffffffff8851e0ea>{:nfsd:nfsd_dispatch+216} <ffffffff884d549a>{:sunrpc:svc_process+982} Aug 26 11:33:11 soap01 kernel: <ffffffff802ea247>{__down_read+21} <ffffffff8851e46e>{:nfsd:nfsd+0} Aug 26 11:33:11 soap01 kernel: <ffffffff8851e63d>{:nfsd:nfsd+463} <ffffffff8010bed2>{child_rip+8} Aug 26 11:33:11 soap01 kernel: <ffffffff8851e46e>{:nfsd:nfsd+0} <ffffffff8851e46e>{:nfsd:nfsd+0} Aug 26 11:33:11 soap01 kernel: <ffffffff8010beca>{child_rip+0} Aug 26 11:33:11 soap01 kernel: Aug 26 11:33:11 soap01 kernel: Code: 0f 0b 68 b9 8a 5e 88 c2 5a 04 41 ff 4c 24 08 49 39 5d 28 75 Aug 26 11:33:11 soap01 kernel: RIP <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} RSP <ffff81021e9f1c18> Operating system: SuSE SLES 10SP1 Kernel: 2.6.16.60-0.21-smp OCFS2: 1.4.0-SLES Environment: * 2 FreeBSD 7.1-RELEASE-p2 NFS Clients * 2 SLES 10SP1 exporting the filesystem The FreeBSD clients are our email servers, so the main traffic is many small email files. NFS mounted with protocol version 3, readdirplus disabled, read and write buffer of 32k. Pre-crash symptoms: * The ocfs filesystem hung for a while or gets very slow; * Low or null device traffic on both nodes (checked with `iostat`); * The server load get 5 to 6 points higher; * It seems that something in kernel deadlock, as other processes (doing IO, but in other mount points with raiserfs) hug a CPU with 100% usage; Eg: There is a mysql database in raiserfs mount point and the mysqld hug the CPU when I call `rcmysql stop`; * Calling `reboot` or `shutdown -r now` blocks the console (didn't tried to run it with strace to get the locking point, but will try it happen again); * imapd on clients blocked in nfs requests; One of the processes was blocked in (FreeBSD kernel) state bo_wwa. Looking in some discussion group's over the net, this state means blocked by stale NFS server. Attaching to the process with `gdb`, its always blocked in close() libc call; imapd process backtrace: #0 0x282a5da3 in close () from /lib/libc.so.7 #1 0x282a5711 in memcpy () from /lib/libc.so.7 #2 0xbfbf9378 in ?? () #3 0x2828d58d in fclose () from /lib/libc.so.7 Could it be related to o2cb configuration? Current configuration: O2CB_HEARTBEAT_THRESHOLD=61 O2CB_IDLE_TIMEOUT_MS=60000 The heartbeat network is a GBit ethernet. Regards, -- .:''''':. .:' ` Sérgio Surkamp | Gerente de Rede :: ........ ser...@gruposinternet.com.br `:. .:' `:, ,.:' *Grupos Internet S.A.* `: :' R. Lauro Linhares, 2123 Torre B - Sala 201 : : Trindade - Florianópolis - SC :.' :: +55 48 3234-4109 : ' http://www.gruposinternet.com.br _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users