Re: [opensuse] How to troubleshoot system hang

Peter Sjoberg Mon, 09 Oct 2006 15:59:04 -0700

On Wed, 2006-09-27 at 15:18 -0400, Peter Sjoberg wrote:
> On Wed, 2006-09-27 at 08:46 -0400, Carl Hartung wrote:
> > On Wednesday 27 September 2006 04:56, Peter Sjoberg wrote:
> > > Normally I would agree but this happens to be my primary server so I put
> > > some extra $$ on it and have matched pairs of Kingston DDR
> > > PC3200/ECC/REG (KVR400D8R3AK2/1G) modules.
> > 
> > I hope you didn't pay /too/ much of a premium, Peter. ;-)
> At the time I got them no-name 512Mb DDR3200 (no ECC or REG) was around
> $65-$75 or $260-$300. I got two kits (2x 2x512=2G) for $320 so it wasn't
> that much extra.
> > 
> > The 'VR' in that part number stands for "Value RAM" which is Kingston's 
> > "industry standard" product line... meaning essentially 'generic' designs 
> > built using chips purchased on the open market so they can sell 
> > competitively 
> > 'down market'... as compared to their more expensive premium product line.
> Didn't know the whole story but guessed that "Value" wasn't the same as
> Premium.
> > 
> > In any event, even buying premium parts from a well known and established 
> > manufacturer only *improves* the liklihood of a successful outcome. It 
> > *doesn't* guarantee that every part will operate perfectly fresh from the 
> > factory. There are just too many reasons for this to be true than I have 
> > room 
> > or time to elaborate on here.
> > 
> > > Since ECC is enabled I would 
> > > expect it to complain somewhere if it discovered ECC errors.
> > 
> > ECC only covers selected regions of a much larger spectrum of fault 
> > possibilities. I've been in the industry for almost 20 years and actually 
> > worked for a high end memory manufacturer in Silicon Valley before 
> > production 
> > moved offshore... so I know a little bit about how these things work. ;-)
> One possibility is that all this BIOS ECC parameters are not optimal. I
> don't know enough about ECC scrub, direction, 4-bit, DCACHE etc so I
> left it all at it's default, just verified that ECC was enabled.
> The mobo is a Tyan K8SE/S2892 and I have installed the latest bios
> version. 
> 
> > 
> > > Also, as a test I tried to provoke the system to hang by compile the
> > > kernel in a loop, worked fine for 35h
> > 
> > This is a compelling factor and you could be 100% right. However, given the 
> > classic nature of the symptoms, if /I/ were managing this problem I'd 
> > wholesale swap the modules out with a premium set from another established 
> > manufacturer... even a set borrowed from another machine just as a test. 
> The nature is that it can go between 12h and 18 days between the hangs,
> making it hard to declare fixed. One downside is that it seems to be
> more often lately, haven't had uptime over 3 days for a while - which of
> course points towards bad hw (since sw hasn't changed).
> Swapping I could do but I have nothing to swap with, and buying more
> memory as a test is not an option (but if it gets approved by the wife
> department I might add 4x1G to it)
> > Your 
> > time and this system's downtime *must* be costing a lot more than the delta 
> > in price.
> Nope, this is home server. I have my mails, nfs, /home, ldap, samba  etc
> there but when it's down the only impact is the rest of the family and
> that the mails are queued on a different server.
> 
> > 
> > If you try this and the problem goes away, Kingston might even credit your 
> > existing purchase towards an upgrade to their premium line so they don't 
> > lose 
> > the sale. This is particularly true if they think you'll end up returning 
> > the 
> > parts for cause... which could happen if another brand solves the problem.
> I'm planning on let it run memcheck for a while but donät expect to find
> anything there.
> 
> > 
> > Just some things to think about... YMMV and all that.
> Thanks anyway
> 
> What I'm looking for is other ways to diagnos a system hang.
> If it was some kind of memory or hw error I would expect the way it goes
> would be a bit different each time but at least the last 5 times it has
> been just about exactly the same, dead hang and only sysrq that works os
> "b". For me this points a little towards some os issue. It could be raid
> drivers (running sw raid5), Nic bonding drivers, or one of the many
> other things that running, and I would like to know at least what area
> to look in. The only thing I excluded right now is vmware server. I was
> running that but at one time I removed it from the start and after next
> crash it didn't start (leaving the kernel untainted) but the server
> still died.
> 
> With the latest bios update came a Watchdog function but I didn't find
> any driver for it so at the moment I have to turn it off or it will
> reboot after the given watchdog time.
Just an update
I found some watchdog program and installed that. Didn't have to wait
long before I got it tested and it worked as designed and rebooted the
system when it did hang.
Looking at the timings of the hang I started to see that it often
happend when it was heavy disk activity (backup, some mirror scripts
etc) so I started to suspect something in the disk drivers (running sw
raid1 & raid5)
I was running the latest kernel but then a new one came out
(2.6.16.21-0.25-smp) so I upgraded. Specially after seeing in the
changelog that some raid fixes where installed.
Since I upgraded the kernel the system have now been running for 7 days
while previous it froze 3-5 times/week. I even pushed a few extra
backups, mirror scripts etc and it all seems to stay up.


I leave the watchdog stuff running and learned a bit about
troubleshooting so it's not all in wain. 

> 
> > 
> > hth & regards,
> > 
> > Carl
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [opensuse] How to troubleshoot system hang

Reply via email to