Too early to call. Management made the call "This hardware seems to have been stable, lets use it".
> -----Original Message----- > From: Sunil Mushran [mailto:[EMAIL PROTECTED] > Sent: Monday, July 30, 2007 11:07 > To: Ulf Zimmermann > Cc: [email protected] > Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots > > So are you suggesting the reason was bad hardware? > Or, is it too early to call? > > Ulf Zimmermann wrote: > > I have serial console setup with logging via conserver but so far no > > further crash. We also swapped hardware a bit around (another 4 node > > cluster with DL360g5 was working without crash for several weeks, we > > swapped those 4 nodes in for the first 4 in the 6 node cluster). > > > > > >> -----Original Message----- > >> From: Sunil Mushran [mailto:[EMAIL PROTECTED] > >> Sent: Monday, July 30, 2007 10:21 > >> To: Ulf Zimmermann > >> Cc: [email protected] > >> Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots > >> > >> Do you have a netconsole setup? If not, set it up. That will capture > >> > > the > > > >> real reason for the reset. Well, it typically does. > >> > >> Ulf Zimmermann wrote: > >> > >>> We just installed a new cluster with 6 HP DL380g5, dual single port > >>> > >> Qlogic 24xx HBAs connected via two HP 4/16 Storageworks switches to a > >> > > 3Par > > > >> S400. We are using the 3Par recommended config for the Qlogic driver > >> > > and > > > >> device-mapper-multipath giving us 4 paths to the SAN. We do see some > >> > > SCSI > > > >> errors where DM-MP is failing a path after get a 0x2000 error from the > >> > > SAN > > > >> controller, but the path gets puts back in service in less then 10 > >> seconds. > >> > >>> This needs to be fixed but I don't think it is what is causing our > >>> > >> reboots. 2 of the nodes rebooted once while being idle (ocfs2 and > >> clusterware were running, no db) and one node rebooted while idle > >> > > (another > > > >> node was copying using fscat our 9i db from ocfs1 to the ocfs2 data > >> volume) and once while some load was put on it via the upgraded 10g > >> database. In all cases it is as if someone a hardware reset button. No > >> kernel panic (at least not one leading to a stop with visable > >> > > message), we > > > >> can get a dirty write cache for the internal cciss controller. > >> > >>> The only messages we get on the nodes are when the crashed node is > >>> > >> already in reset and it missed its ocfs2 heartbeat (set to the default > >> > > of > > > >> 7), followed later by crs moving the vip. > >> > >>> Any hints on trouble shooting this would be appreciated. > >>> > >>> Regards, Ulf. > >>> > >>> > >>> -------------------------- > >>> Sent from my BlackBerry Wireless Handheld > >>> > >>> > >>> > >>> > > ------------------------------------------------------------------------ > > > >>> _______________________________________________ > >>> Ocfs2-users mailing list > >>> [email protected] > >>> http://oss.oracle.com/mailman/listinfo/ocfs2-users > >>> _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
