That's troubling, these are really static systems. I know anything can happen, but to inherit a kernel issue two years later seems nuts. Not that your analysis is wrong, just blows me away is all. Is there a chance I would be better off removing this node and replacing it with a fresh build?
----- Original Message ----- From: "Sunil Mushran" <sunil.mush...@oracle.com> To: "B Leggett" <blegg...@ngent.com> Cc: ocfs2-users@oss.oracle.com Sent: Wednesday, June 29, 2011 5:23:40 PM GMT -05:00 US/Canada Eastern Subject: Re: [Ocfs2-users] OCFS2 Crash You should ping your kernel vendor. While this does not look ocfs2 related, even if it did, you will be first asked to upgrade to a more recent kernel, etc. And all those bits will come from the vendor. On 06/29/2011 02:20 PM, B Leggett wrote: > Sunril, > After that first attempt I tried severla more times and got actual oops. I > think try #3 has the most details. > > Try #2: > > Oops: 0000 [#1] > SMP > last sysfs file: /firmware/edd/int13_dev80/mbr_signature > Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager > configfs ipv6 iscsi_tcp libiscsi scsi_transport_iscsi xofs button battery ac > apparmor aamatch_pcre loop dm_mod netconsole usbhid cpqphp i2c_piix4 ohci_hcd > sworks_agp ide_cd cdrom pci_hotplug i2c_core agpgart usbcore tg3 reiserfs edd > fan thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core > CPU: 0 > EIP: 0060:[<c029723e>] Tainted: P X VLI > EFLAGS: 00210086 (2.6.16.21-0.8-bigsmp #1) > EIP is at do_page_fault+0x8e/0x5f6 > eax: f3f64000 ebx: c02fbc00 ecx: 00000000 edx: 00000000 > esi: f3f6605c edi: c02971b0 ebp: 00000098 esp: f3f64088 > ds: 007b es: 007b ss: 0068 > > > Try#3 > > Oops: 0000 [#1] > SMP > last sysfs file: /firmware/edd/int13_dev80/mbr_signature > Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager > configfs ipv6 iscsi_tcp libiscsi scsi_transport_iscsi xofs button battery ac > apparmor aamatch_pcre loop dm_mod netconsole usbhid i2c_piix4 ide_cd cpqphp > cdrom ohci_hcd i2c_core usbcore sworks_agp pci_hotplug agpgart tg3 reiserfs > edd fan thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core > CPU: 2 > EIP: 0060:[<c029723e>] Tainted: P X VLI > EFLAGS: 00210006 (2.6.16.21-0.8-bigsmp #1) > EIP is at do_page_fault+0x8e/0x5f6 > eax: f3f2c000 ebx: 880f0133 ecx: 64656e77 edx: 64656e77 > esi: f3f30058 edi: c02971b0 ebp: 64656f0f esp: f3f2c084 > ds: 007b es: 007b ss: 0068 > Unable to handle kernel paging request at virtual address 01110954 > printing eip: > c029723e > *pde = 33dda001 > Unable to handle kernel NULL pointer dereference at virtual address 00000030 > printing eip: > c015c752 > *pde = 3629c001 > o2net: connection to node node-02 (num 2) at 192.168.1.173:7777 has been idle > for 10 seconds, shutting it down. > (10,0):o2net_idle_timer:1309 here are some times that might help debug the > situation: (tmr 1309364991.767445 now 1309365001.767502 dr 1309364996.769068 > adv 1309364991.767450:1309364991.767451 func (9987e679:2) > 1309364870.220076:1309364870.220078) > o2net: connection to node node-05 (num 4) at 192.168.1.62:7777 has been idle > for 10 seconds, shutting it down. > (10,0):o2net_idle_timer:1309 here are some times that might help debug the > situation: (tmr 1309364991.769291 now 1309365001.767537 dr 1309364996.770248 > adv 1309364991.769302:1309364991.769303 func (3768d12f:505) > 1309364991.769291:1309364991.769296) > Unable to handle kernel paging request at virtual address 4e0b5293 > printing eip: > c024c829 > *pde = 36b61001 > > Try #4 > > Unable to handle kernel paging request at virtual address fffffffc > printing eip: > c016e54e > *pde = 00000000 > Oops: 0000 [#1] > SMP > last sysfs file: /firmware/edd/int13_dev80/mbr_signature > Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager ipv6 > configfs iscsi_tcp libiscsi scsi_transport_iscsi xofs button battery ac > apparmor aamatch_pcre loop dm_mod netconsole usbhid ide_cd cpqphp cdrom > i2c_piix4 ohci_hcd sworks_agp i2c_core usbcore agpgart pci_hotplug tg3 > reiserfs edd fan thermal processor cciss serverworks sd_mod scsi_mod ide_disk > ide_core > CPU: 3 > EIP: 0060:[<c016e54e>] Tainted: P X VLI > EFLAGS: 00010297 (2.6.16.21-0.8-bigsmp #1) > EIP is at poll_freewait+0xd/0x3a > eax: f5ab5f90 ebx: ffffffe4 ecx: dffff040 edx: c1000000 > esi: f31c4000 edi: bffa3bf4 ebp: f34b8310 esp: f5ab5f60 > ds: 007b es: 007b ss: 0068 > Process iscsid (pid: 3206, threadinfo=f5ab4000 task=f54521b0) > Stack:<0>00000000 00000000 c016e85a f5ab5fb0 bffa3bf4 bffa3bf4 00000000 > f34b8310 > 00000002 00000002 00000000 f34b8300 c016f12a f31c4000 00000000 > bffa3be4 > 00000000 b7f08ff4 f5ab4000 c016e8a8 00000000 00000000 c0103cab > bffa3be4 > Call Trace: > [<c016e85a>] do_sys_poll+0x2df/0x2e9 > [<c016f12a>] __pollwait+0x0/0x95 > [<c016e8a8>] sys_poll+0x44/0x47 > [<c0103cab>] sysenter_past_esp+0x54/0x79 > Code: c4 10 89 d8 5b 5e 5f 5d c3 c7 00 2a f1 16 c0 c7 40 08 00 00 00 00 c7 40 > 04 00 00 00 00 c3 56 53 8b 70 04 eb 2c 8b 5e 04 83 eb 1c<8b> 43 18 8d 53 04 > e8 6d 3d fc ff 8b 03 e8 a8 12 ff ff 8d 46 08 > > ----- Original Message ----- > From: "B Leggett"<blegg...@ngent.com> > To: ocfs2-users@oss.oracle.com > Sent: Wednesday, June 29, 2011 3:42:42 PM GMT -05:00 US/Canada Eastern > Subject: Re: [Ocfs2-users] OCFS2 Crash > > For the list, I accidentally sent it direct to Sunil. My apologies for that. > > Bruce > ----- Original Message ----- > From: "B Leggett"<blegg...@ngent.com> > To: "Sunil Mushran"<sunil.mush...@oracle.com> > Sent: Wednesday, June 29, 2011 3:40:52 PM GMT -05:00 US/Canada Eastern > Subject: Re: [Ocfs2-users] OCFS2 Crash > > Sunil, > I did as you requested an got one line of output. > > o2net: accepted connection from node node-05 (num 4) at 192.168.1.62:7777 > > Bruce > ----- Original Message ----- > From: "Sunil Mushran"<sunil.mush...@oracle.com> > To: "B Leggett"<blegg...@ngent.com> > Cc: ocfs2-users@oss.oracle.com > Sent: Wednesday, June 29, 2011 2:42:08 PM GMT -05:00 US/Canada Eastern > Subject: Re: [Ocfs2-users] OCFS2 Crash > > 1.2.1? That's 5 years old. We've had a few fixes since then. ;) > > You have to catch the oops trace to figure out the reason. And one > way to get it by using netconsole. Check the sles10 docs to see how to > configure netconsole. Or, whatever is recommended for capturing the > oops log in that release. > > On 06/29/2011 11:28 AM, B Leggett wrote: >> Hi, >> I am running the OCFS2 1.2.1 on SLES 10, just the stuff right out of the >> box. This is a 3 node cluster that's been running for 2 years with just >> about zero modification. The storage is a high end SAN and the transport is >> iscsi. We went two years without an issue and all a sudden node 1 in the >> cluster keeps crashing. I have never had to troubleshoot OCFS2, so I started >> with what I could control. >> >> I checked /var/log/messages and nothing there suggests a problem. I replaced >> hardware that went as far as me popping the scsi drives out and putting them >> in another server and trying it with all new hardware. The problem still >> persists. >> >> I had the network team check the iscsi port on the private iscsi network and >> they are not seeing errors. >> >> I've check the few OCFS2 settings in play and they all look good. >> >> My question to the group is how go I continue troubleshooting this issue? >> I'm not aware of any native logs etc to reference. I would appreciate any >> help that gets this diagnosis moving to a solution. >> >> Thanks, >> Bruce > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users