Sunril, After that first attempt I tried severla more times and got actual oops. I think try #3 has the most details.
Try #2: Oops: 0000 [#1] SMP last sysfs file: /firmware/edd/int13_dev80/mbr_signature Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs ipv6 iscsi_tcp libiscsi scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop dm_mod netconsole usbhid cpqphp i2c_piix4 ohci_hcd sworks_agp ide_cd cdrom pci_hotplug i2c_core agpgart usbcore tg3 reiserfs edd fan thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core CPU: 0 EIP: 0060:[<c029723e>] Tainted: P X VLI EFLAGS: 00210086 (2.6.16.21-0.8-bigsmp #1) EIP is at do_page_fault+0x8e/0x5f6 eax: f3f64000 ebx: c02fbc00 ecx: 00000000 edx: 00000000 esi: f3f6605c edi: c02971b0 ebp: 00000098 esp: f3f64088 ds: 007b es: 007b ss: 0068 Try#3 Oops: 0000 [#1] SMP last sysfs file: /firmware/edd/int13_dev80/mbr_signature Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs ipv6 iscsi_tcp libiscsi scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop dm_mod netconsole usbhid i2c_piix4 ide_cd cpqphp cdrom ohci_hcd i2c_core usbcore sworks_agp pci_hotplug agpgart tg3 reiserfs edd fan thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core CPU: 2 EIP: 0060:[<c029723e>] Tainted: P X VLI EFLAGS: 00210006 (2.6.16.21-0.8-bigsmp #1) EIP is at do_page_fault+0x8e/0x5f6 eax: f3f2c000 ebx: 880f0133 ecx: 64656e77 edx: 64656e77 esi: f3f30058 edi: c02971b0 ebp: 64656f0f esp: f3f2c084 ds: 007b es: 007b ss: 0068 Unable to handle kernel paging request at virtual address 01110954 printing eip: c029723e *pde = 33dda001 Unable to handle kernel NULL pointer dereference at virtual address 00000030 printing eip: c015c752 *pde = 3629c001 o2net: connection to node node-02 (num 2) at 192.168.1.173:7777 has been idle for 10 seconds, shutting it down. (10,0):o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr 1309364991.767445 now 1309365001.767502 dr 1309364996.769068 adv 1309364991.767450:1309364991.767451 func (9987e679:2) 1309364870.220076:1309364870.220078) o2net: connection to node node-05 (num 4) at 192.168.1.62:7777 has been idle for 10 seconds, shutting it down. (10,0):o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr 1309364991.769291 now 1309365001.767537 dr 1309364996.770248 adv 1309364991.769302:1309364991.769303 func (3768d12f:505) 1309364991.769291:1309364991.769296) Unable to handle kernel paging request at virtual address 4e0b5293 printing eip: c024c829 *pde = 36b61001 Try #4 Unable to handle kernel paging request at virtual address fffffffc printing eip: c016e54e *pde = 00000000 Oops: 0000 [#1] SMP last sysfs file: /firmware/edd/int13_dev80/mbr_signature Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager ipv6 configfs iscsi_tcp libiscsi scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop dm_mod netconsole usbhid ide_cd cpqphp cdrom i2c_piix4 ohci_hcd sworks_agp i2c_core usbcore agpgart pci_hotplug tg3 reiserfs edd fan thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core CPU: 3 EIP: 0060:[<c016e54e>] Tainted: P X VLI EFLAGS: 00010297 (2.6.16.21-0.8-bigsmp #1) EIP is at poll_freewait+0xd/0x3a eax: f5ab5f90 ebx: ffffffe4 ecx: dffff040 edx: c1000000 esi: f31c4000 edi: bffa3bf4 ebp: f34b8310 esp: f5ab5f60 ds: 007b es: 007b ss: 0068 Process iscsid (pid: 3206, threadinfo=f5ab4000 task=f54521b0) Stack: <0>00000000 00000000 c016e85a f5ab5fb0 bffa3bf4 bffa3bf4 00000000 f34b8310 00000002 00000002 00000000 f34b8300 c016f12a f31c4000 00000000 bffa3be4 00000000 b7f08ff4 f5ab4000 c016e8a8 00000000 00000000 c0103cab bffa3be4 Call Trace: [<c016e85a>] do_sys_poll+0x2df/0x2e9 [<c016f12a>] __pollwait+0x0/0x95 [<c016e8a8>] sys_poll+0x44/0x47 [<c0103cab>] sysenter_past_esp+0x54/0x79 Code: c4 10 89 d8 5b 5e 5f 5d c3 c7 00 2a f1 16 c0 c7 40 08 00 00 00 00 c7 40 04 00 00 00 00 c3 56 53 8b 70 04 eb 2c 8b 5e 04 83 eb 1c <8b> 43 18 8d 53 04 e8 6d 3d fc ff 8b 03 e8 a8 12 ff ff 8d 46 08 ----- Original Message ----- From: "B Leggett" <blegg...@ngent.com> To: ocfs2-users@oss.oracle.com Sent: Wednesday, June 29, 2011 3:42:42 PM GMT -05:00 US/Canada Eastern Subject: Re: [Ocfs2-users] OCFS2 Crash For the list, I accidentally sent it direct to Sunil. My apologies for that. Bruce ----- Original Message ----- From: "B Leggett" <blegg...@ngent.com> To: "Sunil Mushran" <sunil.mush...@oracle.com> Sent: Wednesday, June 29, 2011 3:40:52 PM GMT -05:00 US/Canada Eastern Subject: Re: [Ocfs2-users] OCFS2 Crash Sunil, I did as you requested an got one line of output. o2net: accepted connection from node node-05 (num 4) at 192.168.1.62:7777 Bruce ----- Original Message ----- From: "Sunil Mushran" <sunil.mush...@oracle.com> To: "B Leggett" <blegg...@ngent.com> Cc: ocfs2-users@oss.oracle.com Sent: Wednesday, June 29, 2011 2:42:08 PM GMT -05:00 US/Canada Eastern Subject: Re: [Ocfs2-users] OCFS2 Crash 1.2.1? That's 5 years old. We've had a few fixes since then. ;) You have to catch the oops trace to figure out the reason. And one way to get it by using netconsole. Check the sles10 docs to see how to configure netconsole. Or, whatever is recommended for capturing the oops log in that release. On 06/29/2011 11:28 AM, B Leggett wrote: > Hi, > I am running the OCFS2 1.2.1 on SLES 10, just the stuff right out of the box. > This is a 3 node cluster that's been running for 2 years with just about zero > modification. The storage is a high end SAN and the transport is iscsi. We > went two years without an issue and all a sudden node 1 in the cluster keeps > crashing. I have never had to troubleshoot OCFS2, so I started with what I > could control. > > I checked /var/log/messages and nothing there suggests a problem. I replaced > hardware that went as far as me popping the scsi drives out and putting them > in another server and trying it with all new hardware. The problem still > persists. > > I had the network team check the iscsi port on the private iscsi network and > they are not seeing errors. > > I've check the few OCFS2 settings in play and they all look good. > > My question to the group is how go I continue troubleshooting this issue? I'm > not aware of any native logs etc to reference. I would appreciate any help > that gets this diagnosis moving to a solution. > > Thanks, > Bruce _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users