On Wed, 2006-09-27 at 15:18 -0400, Peter Sjoberg wrote: > On Wed, 2006-09-27 at 08:46 -0400, Carl Hartung wrote: > > On Wednesday 27 September 2006 04:56, Peter Sjoberg wrote: > > > Normally I would agree but this happens to be my primary server so I put > > > some extra $$ on it and have matched pairs of Kingston DDR > > > PC3200/ECC/REG (KVR400D8R3AK2/1G) modules. > > > > I hope you didn't pay /too/ much of a premium, Peter. ;-) > At the time I got them no-name 512Mb DDR3200 (no ECC or REG) was around > $65-$75 or $260-$300. I got two kits (2x 2x512=2G) for $320 so it wasn't > that much extra. > > > > The 'VR' in that part number stands for "Value RAM" which is Kingston's > > "industry standard" product line... meaning essentially 'generic' designs > > built using chips purchased on the open market so they can sell > > competitively > > 'down market'... as compared to their more expensive premium product line. > Didn't know the whole story but guessed that "Value" wasn't the same as > Premium. > > > > In any event, even buying premium parts from a well known and established > > manufacturer only *improves* the liklihood of a successful outcome. It > > *doesn't* guarantee that every part will operate perfectly fresh from the > > factory. There are just too many reasons for this to be true than I have > > room > > or time to elaborate on here. > > > > > Since ECC is enabled I would > > > expect it to complain somewhere if it discovered ECC errors. > > > > ECC only covers selected regions of a much larger spectrum of fault > > possibilities. I've been in the industry for almost 20 years and actually > > worked for a high end memory manufacturer in Silicon Valley before > > production > > moved offshore... so I know a little bit about how these things work. ;-) > One possibility is that all this BIOS ECC parameters are not optimal. I > don't know enough about ECC scrub, direction, 4-bit, DCACHE etc so I > left it all at it's default, just verified that ECC was enabled. > The mobo is a Tyan K8SE/S2892 and I have installed the latest bios > version. > > > > > > Also, as a test I tried to provoke the system to hang by compile the > > > kernel in a loop, worked fine for 35h > > > > This is a compelling factor and you could be 100% right. However, given the > > classic nature of the symptoms, if /I/ were managing this problem I'd > > wholesale swap the modules out with a premium set from another established > > manufacturer... even a set borrowed from another machine just as a test. > The nature is that it can go between 12h and 18 days between the hangs, > making it hard to declare fixed. One downside is that it seems to be > more often lately, haven't had uptime over 3 days for a while - which of > course points towards bad hw (since sw hasn't changed). > Swapping I could do but I have nothing to swap with, and buying more > memory as a test is not an option (but if it gets approved by the wife > department I might add 4x1G to it) > > Your > > time and this system's downtime *must* be costing a lot more than the delta > > in price. > Nope, this is home server. I have my mails, nfs, /home, ldap, samba etc > there but when it's down the only impact is the rest of the family and > that the mails are queued on a different server. > > > > > If you try this and the problem goes away, Kingston might even credit your > > existing purchase towards an upgrade to their premium line so they don't > > lose > > the sale. This is particularly true if they think you'll end up returning > > the > > parts for cause... which could happen if another brand solves the problem. > I'm planning on let it run memcheck for a while but donät expect to find > anything there. > > > > > Just some things to think about... YMMV and all that. > Thanks anyway > > What I'm looking for is other ways to diagnos a system hang. > If it was some kind of memory or hw error I would expect the way it goes > would be a bit different each time but at least the last 5 times it has > been just about exactly the same, dead hang and only sysrq that works os > "b". For me this points a little towards some os issue. It could be raid > drivers (running sw raid5), Nic bonding drivers, or one of the many > other things that running, and I would like to know at least what area > to look in. The only thing I excluded right now is vmware server. I was > running that but at one time I removed it from the start and after next > crash it didn't start (leaving the kernel untainted) but the server > still died. > > With the latest bios update came a Watchdog function but I didn't find > any driver for it so at the moment I have to turn it off or it will > reboot after the given watchdog time. Just an update I found some watchdog program and installed that. Didn't have to wait long before I got it tested and it worked as designed and rebooted the system when it did hang. Looking at the timings of the hang I started to see that it often happend when it was heavy disk activity (backup, some mirror scripts etc) so I started to suspect something in the disk drivers (running sw raid1 & raid5) I was running the latest kernel but then a new one came out (2.6.16.21-0.25-smp) so I upgraded. Specially after seeing in the changelog that some raid fixes where installed. Since I upgraded the kernel the system have now been running for 7 days while previous it froze 3-5 times/week. I even pushed a few extra backups, mirror scripts etc and it all seems to stay up.
I leave the watchdog stuff running and learned a bit about troubleshooting so it's not all in wain. > > > > > hth & regards, > > > > Carl > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
