Hello Mr. Sunil and list, Last weekend we have upgraded our servers to SLES10SP3 and with it the OCFS2 version 1.4.1-1.
During the upgrade we have run the fsck in the volume (using the new OCFS2 version) and it just died in "Pass 1c" with the error: fsck.ocfs2: Unknown code while performing pass 1 Checking the dmesg there was this messages repeated many times: (5713,1):ocfs2_orphan_del:1869 ERROR: status = -2 (5713,1):ocfs2_remove_inode:614 ERROR: status = -2 (5713,1):ocfs2_wipe_inode:740 ERROR: status = -2 (5713,1):ocfs2_delete_inode:974 ERROR: status = -2 Could this mean that our filesystem is corrupted? Could it be the cause of this bug to happen? Current versions: OCFS2 DLM 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build f922955d99ef972235bd0c1fc236c5ddbb368611) OCFS2 DLMFS 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build f922955d99ef972235bd0c1fc236c5ddbb368611) OCFS2 User DLM kernel interface loaded bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready bnx2: eth1 NIC Copper Link is Up, 100 Mbps full duplex, receive & transmit flow control ON ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready o2net: connected to node soap02 (num 0) at [--cut--]:7777 OCFS2 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build f922955d99ef972235bd0c1fc236c5ddbb368611) Latest crash dmesg message: Assertion failure in journal_dirty_metadata() at fs/jbd/transaction.c:1114: "handle->h_buffer_credits > 0" ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at fs/jbd/transaction.c:1114 invalid opcode: 0000 [1] SMP last sysfs file: /block/sdb/sdb1/stat CPU 2 Modules linked in: nls_iso8859_1 nls_cp437 vfat fat nls_utf8 nfsd exportfs lockd nfs_acl sunrpc iptable_nat ip_nat iptable_mangle xt_comment xt_tcpudp xt_state ip_conntrack nfnetlink iptable_filter ip_tables x_tables joydev ocfs2 jbd ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs ipv6 dock button battery ac apparmor sr_mod loop usbhid usb_storage uhci_hcd ide_cd shpchp cdrom ehci_hcd hw_random usbcore pci_hotplug bnx2 reiserfs ata_piix ahci libata dm_snapshot edd dm_mod fan thermal processor qla2xxx firmware_class scsi_transport_fc st aic7xxx scsi_transport_spi sg megaraid_sas piix sd_mod scsi_mod ide_disk ide_core Pid: 29322, comm: nfsd Not tainted 2.6.16.60-0.58.1-smp #1 RIP: 0010:[<ffffffff883f11e0>] <ffffffff883f11e0>{:jbd:journal_dirty_metadata+200} RSP: 0018:ffff8101ca763c18 EFLAGS: 00010292 RAX: 000000000000006e RBX: ffff8100cc920a80 RCX: ffffffff80361c68 RDX: ffffffff80361c68 RSI: 0000000000000296 RDI: ffffffff80361c60 RBP: ffff81012e3978c8 R08: ffffffff80361c68 R09: ffff8102244cd540 R10: 0000000000000292 R11: 0000000000000070 R12: ffff8101f00274f8 R13: ffff810074c031f0 R14: ffff810226624a00 R15: ffff81012352653c FS: 00002ae6d8b156d0(0000) GS:ffff81022856a2c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002aaaafb2d000 CR3: 000000009e42a000 CR4: 00000000000006e0 Process nfsd (pid: 29322, threadinfo ffff8101ca762000, task ffff8102284de810) Stack: ffff81012e3978c8 ffff8101f00274f8 0000000000000000 0000000000000005 ffff81019ea88000 ffffffff8843264a 0000000000000000 ffff81019ea88050 ffff81000b55cc20 ffffffff8841b445 Call Trace: <ffffffff8843264a>{:ocfs2:ocfs2_journal_dirty+106} <ffffffff8841b445>{:ocfs2:__ocfs2_add_entry+745} <ffffffff884396a9>{:ocfs2:ocfs2_mknod+1710} <ffffffff88439988>{:ocfs2:ocfs2_mkdir+127} <ffffffff80194075>{vfs_mkdir+346} <ffffffff88526ef5>{:nfsd:nfsd_create+753} <ffffffff8852dc63>{:nfsd:nfsd3_proc_mkdir+217} <ffffffff885220ea>{:nfsd:nfsd_dispatch+216} <ffffffff884d975b>{:sunrpc:svc_process+982} <ffffffff802f0c59>{__down_read+21} <ffffffff88522454>{:nfsd:nfsd+0} <ffffffff88522623>{:nfsd:nfsd+463} <ffffffff8010bea6>{child_rip+8} <ffffffff88522454>{:nfsd:nfsd+0} <ffffffff88522454>{:nfsd:nfsd+0} <ffffffff8010be9e>{child_rip+0} Code: 0f 0b 68 d9 7a 3f 88 c2 5a 04 41 ff 4c 24 08 49 39 5d 28 75 RIP <ffffffff883f11e0>{:jbd:journal_dirty_metadata+200} RSP <ffff8101ca763c18> Bottom line: We are contacting Novell by phone support to check why OCFS2 version isn't the latest in SLES10SP3. Regards, Sérgio Em Tue, 01 Sep 2009 11:14:37 -0700 Sunil Mushran <sunil.mush...@oracle.com> escreveu: > For issues on sles, please file a bug/sr with novell. > > The issue here is insufficient journal credits. It _could_ be that > this version > is missing mainline git commit > e051fda4fd14fe878e6d2183b3a4640febe9e9a8. But I don't know. Novell > Support will be better placed to track down the issue. > > Sérgio Surkamp wrote: > > Hi list, > > > > One of our OCFS2 servers crashed with this message: > > > > Aug 26 11:33:11 soap01 kernel: Assertion failure in > > journal_dirty_metadata() at fs/jbd/transaction.c:1114: > > "handle->h_buffer_credits > 0" > > Aug 26 11:33:11 soap01 kernel: ----------- [cut here ] --------- > > [please bite here ] --------- > > Aug 26 11:33:11 soap01 kernel: Kernel BUG at > > fs/jbd/transaction.c:1114 Aug 26 11:33:11 soap01 kernel: invalid > > opcode: 0000 [1] SMP Aug 26 11:33:11 soap01 kernel: last sysfs > > file: /devices/pci0000:00/0000:00:00.0/irq > > Aug 26 11:33:11 soap01 kernel: CPU 0 Aug 26 11:33:11 soap01 kernel: > > Modules linked in: af_packet joydev ocfs2 jbd ocfs2_dlmfs ocfs2_dlm > > ocfs2_nodemanager configfs nfsd exportfs lockd nfs_acl sunrpc ipv6 > > button battery ac netconsole xt_comment xt_tcpudp xt_state > > iptable_filter iptable_mangle iptable_nat ip_nat ip_ conntrack > > nfnetlink ip_tables x_tables apparmor loop st sr_mod usbhid > > usb_storage hw_random shpchp ide_cd aic7xxx uhci_hcd cdrom > > pci_hotplug ehci_hcd scsi_transport_spi usbcore bnx2 reiserfs > > ata_piix ahci libata dm_snapshot qla2xxx firmware_class > > qla2xxx_conf intermodule edd d m_mod fan thermal processor sg > > megaraid_sas piix sd_mod scsi_mod ide_disk ide_core Aug 26 11:33:11 > > soap01 kernel: Pid: 4874, comm: nfsd Tainted: G U > > 2.6.16.60-0.21-smp #1 Aug 26 11:33:11 soap01 kernel: RIP: > > 0010:[<ffffffff885e21e0>] > > <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} Aug 26 11:33:11 > > soap01 kernel: RSP: 0018:ffff81021e9f1c18 EFLAGS: 00010292 > > Aug 26 11:33:11 soap01 kernel: RAX: 000000000000006e RBX: > > ffff8101decf30c0 RCX: 0000000000000292 > > Aug 26 11:33:11 soap01 kernel: RDX: ffffffff80359968 RSI: > > 0000000000000296 RDI: ffffffff80359960 > > Aug 26 11:33:11 soap01 kernel: RBP: ffff81002f753870 R08: > > ffffffff80359968 R09: ffff810221d3ad80 > > Aug 26 11:33:11 soap01 kernel: R10: ffff810001035680 R11: > > 0000000000000070 R12: ffff8101dda21588 > > Aug 26 11:33:11 soap01 kernel: R13: ffff810207e2fa90 R14: > > ffff8102277ab400 R15: ffff8100a4dd394c > > Aug 26 11:33:11 soap01 kernel: FS: 00002b7055e986d0(0000) > > GS:ffffffff803d3000(0000) knlGS:0000000000000000 > > Aug 26 11:33:11 soap01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > > 000000008005003b > > Aug 26 11:33:11 soap01 kernel: CR2: 00002aaaaabdb000 CR3: > > 000000015e180000 CR4: 00000000000006e0 > > Aug 26 11:33:11 soap01 kernel: Process nfsd (pid: 4874, threadinfo > > ffff81021e9f0000, task ffff81021f92e860) > > Aug 26 11:33:11 soap01 kernel: Stack: ffff81002f753870 > > ffff8101dda21588 0000000000000000 0000000000000003 > > Aug 26 11:33:11 soap01 kernel: ffff81018ba52000 ffffffff8862187f > > 0000000000000000 ffff81018ba52040 > > Aug 26 11:33:11 soap01 kernel: ffff81007f5163f8 ffffffff8860b38a > > Aug 26 11:33:11 soap01 kernel: Call Trace: > > <ffffffff8862187f>{:ocfs2:ocfs2_journal_dirty+106} > > Aug 26 11:33:11 soap01 kernel: > > <ffffffff8860b38a>{:ocfs2:__ocfs2_add_entry+745} > > <ffffffff88628766>{:ocfs2:ocfs2_mknod+1710} > > Aug 26 11:33:11 soap01 kernel: > > <ffffffff88628a45>{:ocfs2:ocfs2_mkdir+127} > > <ffffffff80192b48>{vfs_mkdir+346} > > Aug 26 11:33:11 soap01 kernel: > > <ffffffff88522f05>{:nfsd:nfsd_create+753} > > <ffffffff88529bb2>{:nfsd:nfsd3_proc_mkdir+217} Aug 26 11:33:11 > > soap01 kernel: <ffffffff8851e0ea>{:nfsd:nfsd_dispatch+216} > > <ffffffff884d549a>{:sunrpc:svc_process+982} > > Aug 26 11:33:11 soap01 kernel: > > <ffffffff802ea247>{__down_read+21} <ffffffff8851e46e>{:nfsd:nfsd+0} > > Aug 26 11:33:11 soap01 kernel: <ffffffff8851e63d>{:nfsd:nfsd+463} > > <ffffffff8010bed2>{child_rip+8} > > Aug 26 11:33:11 soap01 kernel: <ffffffff8851e46e>{:nfsd:nfsd+0} > > <ffffffff8851e46e>{:nfsd:nfsd+0} > > Aug 26 11:33:11 soap01 kernel: <ffffffff8010beca>{child_rip+0} > > Aug 26 11:33:11 soap01 kernel: > > Aug 26 11:33:11 soap01 kernel: Code: 0f 0b 68 b9 8a 5e 88 c2 5a 04 > > 41 ff 4c 24 08 49 39 5d 28 75 > > Aug 26 11:33:11 soap01 kernel: RIP > > <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} RSP > > <ffff81021e9f1c18> > > > > Operating system: SuSE SLES 10SP1 > > Kernel: 2.6.16.60-0.21-smp > > OCFS2: 1.4.0-SLES > > > > Environment: > > > > * 2 FreeBSD 7.1-RELEASE-p2 NFS Clients > > * 2 SLES 10SP1 exporting the filesystem > > > > The FreeBSD clients are our email servers, so the main traffic is > > many small email files. > > > > NFS mounted with protocol version 3, readdirplus disabled, read and > > write buffer of 32k. > > > > Pre-crash symptoms: > > * The ocfs filesystem hung for a while or gets very slow; > > * Low or null device traffic on both nodes (checked with `iostat`); > > * The server load get 5 to 6 points higher; > > * It seems that something in kernel deadlock, as other processes > > (doing IO, but in other mount points with raiserfs) hug a CPU with > > 100% usage; > > Eg: There is a mysql database in raiserfs mount point and the > > mysqld hug the CPU when I call `rcmysql stop`; > > * Calling `reboot` or `shutdown -r now` blocks the console (didn't > > tried to run it with strace to get the locking point, but will try > > it happen again); > > * imapd on clients blocked in nfs requests; > > One of the processes was blocked in (FreeBSD kernel) state bo_wwa. > > Looking in some discussion group's over the net, this state means > > blocked by stale NFS server. Attaching to the process with `gdb`, > > its always blocked in close() libc call; > > > > imapd process backtrace: > > #0 0x282a5da3 in close () from /lib/libc.so.7 > > #1 0x282a5711 in memcpy () from /lib/libc.so.7 > > #2 0xbfbf9378 in ?? () > > #3 0x2828d58d in fclose () from /lib/libc.so.7 > > > > Could it be related to o2cb configuration? Current configuration: > > > > O2CB_HEARTBEAT_THRESHOLD=61 > > O2CB_IDLE_TIMEOUT_MS=60000 > > > > The heartbeat network is a GBit ethernet. > > > > Regards, > > -- .:''''':. .:' ` Sérgio Surkamp | Gerente de Rede :: ........ ser...@gruposinternet.com.br `:. .:' `:, ,.:' *Grupos Internet S.A.* `: :' R. Lauro Linhares, 2123 Torre B - Sala 201 : : Trindade - Florianópolis - SC :.' :: +55 48 3234-4109 : ' http://www.gruposinternet.com.br _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users