On Wed, 29 Jun 2011 16:43:09 -0500 (GMT-05:00), B Leggett wrote: > That's troubling, these are really static systems. I know anything > can happen, but to inherit a kernel issue two years later seems nuts. > Not that your analysis is wrong, just blows me away is all. Is there > a > chance I would be better off removing this node and replacing it with > a fresh build?
as your oopses all look different, i'd first replace all ram on the node in question. i had machines behave this strange with faulty ram several times. just my 2c best regards, jürgen > > ----- Original Message ----- > From: "Sunil Mushran" <sunil.mush...@oracle.com> > To: "B Leggett" <blegg...@ngent.com> > Cc: ocfs2-users@oss.oracle.com > Sent: Wednesday, June 29, 2011 5:23:40 PM GMT -05:00 US/Canada > Eastern > Subject: Re: [Ocfs2-users] OCFS2 Crash > > You should ping your kernel vendor. While this does not look ocfs2 > related, even if it did, you will be first asked to upgrade to a more > recent kernel, etc. And all those bits will come from the vendor. > > On 06/29/2011 02:20 PM, B Leggett wrote: >> Sunril, >> After that first attempt I tried severla more times and got actual >> oops. I think try #3 has the most details. >> >> Try #2: >> >> Oops: 0000 [#1] >> SMP >> last sysfs file: /firmware/edd/int13_dev80/mbr_signature >> Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm >> ocfs2_nodemanager configfs ipv6 iscsi_tcp libiscsi >> scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop >> dm_mod netconsole usbhid cpqphp i2c_piix4 ohci_hcd sworks_agp ide_cd >> cdrom pci_hotplug i2c_core agpgart usbcore tg3 reiserfs edd fan >> thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core >> CPU: 0 >> EIP: 0060:[<c029723e>] Tainted: P X VLI >> EFLAGS: 00210086 (2.6.16.21-0.8-bigsmp #1) >> EIP is at do_page_fault+0x8e/0x5f6 >> eax: f3f64000 ebx: c02fbc00 ecx: 00000000 edx: 00000000 >> esi: f3f6605c edi: c02971b0 ebp: 00000098 esp: f3f64088 >> ds: 007b es: 007b ss: 0068 >> >> >> Try#3 >> >> Oops: 0000 [#1] >> SMP >> last sysfs file: /firmware/edd/int13_dev80/mbr_signature >> Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm >> ocfs2_nodemanager configfs ipv6 iscsi_tcp libiscsi >> scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop >> dm_mod netconsole usbhid i2c_piix4 ide_cd cpqphp cdrom ohci_hcd >> i2c_core usbcore sworks_agp pci_hotplug agpgart tg3 reiserfs edd fan >> thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core >> CPU: 2 >> EIP: 0060:[<c029723e>] Tainted: P X VLI >> EFLAGS: 00210006 (2.6.16.21-0.8-bigsmp #1) >> EIP is at do_page_fault+0x8e/0x5f6 >> eax: f3f2c000 ebx: 880f0133 ecx: 64656e77 edx: 64656e77 >> esi: f3f30058 edi: c02971b0 ebp: 64656f0f esp: f3f2c084 >> ds: 007b es: 007b ss: 0068 >> Unable to handle kernel paging request at virtual address 01110954 >> printing eip: >> c029723e >> *pde = 33dda001 >> Unable to handle kernel NULL pointer dereference at virtual address >> 00000030 >> printing eip: >> c015c752 >> *pde = 3629c001 >> o2net: connection to node node-02 (num 2) at 192.168.1.173:7777 has >> been idle for 10 seconds, shutting it down. >> (10,0):o2net_idle_timer:1309 here are some times that might help >> debug the situation: (tmr 1309364991.767445 now 1309365001.767502 dr >> 1309364996.769068 adv 1309364991.767450:1309364991.767451 func >> (9987e679:2) 1309364870.220076:1309364870.220078) >> o2net: connection to node node-05 (num 4) at 192.168.1.62:7777 has >> been idle for 10 seconds, shutting it down. >> (10,0):o2net_idle_timer:1309 here are some times that might help >> debug the situation: (tmr 1309364991.769291 now 1309365001.767537 dr >> 1309364996.770248 adv 1309364991.769302:1309364991.769303 func >> (3768d12f:505) 1309364991.769291:1309364991.769296) >> Unable to handle kernel paging request at virtual address 4e0b5293 >> printing eip: >> c024c829 >> *pde = 36b61001 >> >> Try #4 >> >> Unable to handle kernel paging request at virtual address fffffffc >> printing eip: >> c016e54e >> *pde = 00000000 >> Oops: 0000 [#1] >> SMP >> last sysfs file: /firmware/edd/int13_dev80/mbr_signature >> Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm >> ocfs2_nodemanager ipv6 configfs iscsi_tcp libiscsi >> scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop >> dm_mod netconsole usbhid ide_cd cpqphp cdrom i2c_piix4 ohci_hcd >> sworks_agp i2c_core usbcore agpgart pci_hotplug tg3 reiserfs edd fan >> thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core >> CPU: 3 >> EIP: 0060:[<c016e54e>] Tainted: P X VLI >> EFLAGS: 00010297 (2.6.16.21-0.8-bigsmp #1) >> EIP is at poll_freewait+0xd/0x3a >> eax: f5ab5f90 ebx: ffffffe4 ecx: dffff040 edx: c1000000 >> esi: f31c4000 edi: bffa3bf4 ebp: f34b8310 esp: f5ab5f60 >> ds: 007b es: 007b ss: 0068 >> Process iscsid (pid: 3206, threadinfo=f5ab4000 task=f54521b0) >> Stack:<0>00000000 00000000 c016e85a f5ab5fb0 bffa3bf4 bffa3bf4 >> 00000000 f34b8310 >> 00000002 00000002 00000000 f34b8300 c016f12a f31c4000 >> 00000000 bffa3be4 >> 00000000 b7f08ff4 f5ab4000 c016e8a8 00000000 00000000 >> c0103cab bffa3be4 >> Call Trace: >> [<c016e85a>] do_sys_poll+0x2df/0x2e9 >> [<c016f12a>] __pollwait+0x0/0x95 >> [<c016e8a8>] sys_poll+0x44/0x47 >> [<c0103cab>] sysenter_past_esp+0x54/0x79 >> Code: c4 10 89 d8 5b 5e 5f 5d c3 c7 00 2a f1 16 c0 c7 40 08 00 00 00 >> 00 c7 40 04 00 00 00 00 c3 56 53 8b 70 04 eb 2c 8b 5e 04 83 eb 1c<8b> >> 43 18 8d 53 04 e8 6d 3d fc ff 8b 03 e8 a8 12 ff ff 8d 46 08 >> >> ----- Original Message ----- >> From: "B Leggett"<blegg...@ngent.com> >> To: ocfs2-users@oss.oracle.com >> Sent: Wednesday, June 29, 2011 3:42:42 PM GMT -05:00 US/Canada >> Eastern >> Subject: Re: [Ocfs2-users] OCFS2 Crash >> >> For the list, I accidentally sent it direct to Sunil. My apologies >> for that. >> >> Bruce >> ----- Original Message ----- >> From: "B Leggett"<blegg...@ngent.com> >> To: "Sunil Mushran"<sunil.mush...@oracle.com> >> Sent: Wednesday, June 29, 2011 3:40:52 PM GMT -05:00 US/Canada >> Eastern >> Subject: Re: [Ocfs2-users] OCFS2 Crash >> >> Sunil, >> I did as you requested an got one line of output. >> >> o2net: accepted connection from node node-05 (num 4) at >> 192.168.1.62:7777 >> >> Bruce >> ----- Original Message ----- >> From: "Sunil Mushran"<sunil.mush...@oracle.com> >> To: "B Leggett"<blegg...@ngent.com> >> Cc: ocfs2-users@oss.oracle.com >> Sent: Wednesday, June 29, 2011 2:42:08 PM GMT -05:00 US/Canada >> Eastern >> Subject: Re: [Ocfs2-users] OCFS2 Crash >> >> 1.2.1? That's 5 years old. We've had a few fixes since then. ;) >> >> You have to catch the oops trace to figure out the reason. And one >> way to get it by using netconsole. Check the sles10 docs to see how >> to >> configure netconsole. Or, whatever is recommended for capturing the >> oops log in that release. >> >> On 06/29/2011 11:28 AM, B Leggett wrote: >>> Hi, >>> I am running the OCFS2 1.2.1 on SLES 10, just the stuff right out >>> of the box. This is a 3 node cluster that's been running for 2 years >>> with just about zero modification. The storage is a high end SAN and >>> the transport is iscsi. We went two years without an issue and all a >>> sudden node 1 in the cluster keeps crashing. I have never had to >>> troubleshoot OCFS2, so I started with what I could control. >>> >>> I checked /var/log/messages and nothing there suggests a problem. I >>> replaced hardware that went as far as me popping the scsi drives out >>> and putting them in another server and trying it with all new >>> hardware. The problem still persists. >>> >>> I had the network team check the iscsi port on the private iscsi >>> network and they are not seeing errors. >>> >>> I've check the few OCFS2 settings in play and they all look good. >>> >>> My question to the group is how go I continue troubleshooting this >>> issue? I'm not aware of any native logs etc to reference. I would >>> appreciate any help that gets this diagnosis moving to a solution. >>> >>> Thanks, >>> Bruce >> >> _______________________________________________ >> Ocfs2-users mailing list >> Ocfs2-users@oss.oracle.com >> http://oss.oracle.com/mailman/listinfo/ocfs2-users > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users -- >> XLhost.de ® - Webhosting von supersmall bis eXtra Large << XLhost.de GmbH Jürgen Herrmann, Geschäftsführer Boelckestrasse 21, 93051 Regensburg, Germany Geschäftsführer: Jürgen Herrmann Registriert unter: HRB9918 Umsatzsteuer-Identifikationsnummer: DE245931218 Fon: +49 (0)800 XLHOSTDE [0800 95467833] Fax: +49 (0)800 95467830 Web: http://www.XLhost.de _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users