Does the server have redundant power supplies? Have you tried replacing them at all? Have you tried changing the UPS that the server is on?
-----Original Message----- From: Szlucha, Chris [mailto:[EMAIL PROTECTED]] Sent: Wednesday, March 20, 2002 9:01 AM To: NT 2000 Discussions Subject: Spontaneous Reboots on Dell w/Win2K with NO TRACE! Absolute Myste ry! Ok, here is something that we've been working on that has gone all the way up to Michael Dell himself that I'd like some input on from you guys. Has anyone seen spontaneous reboots on Dell systems where there is absolutely no trace left anywhere in either the Windows environment nor the hardware environment (Dell ESM logs)? Dell's "top engineers" and 5 of us here at the SEC have been working on it for literally 3 months, almost every day, to no avail. Here's the configuration: Hardware: Dell 2550 Dual PIII 1133 Mhz (BIOS v A05) 2 GB RAM PERC 3 PCI RAID Controller 4 72 GB Fujitsu Hard Drives Intel 8255x-based Integrated Fast Ethernet NIC DRAC-II card External PowerVault 128T LTO Tape Library connected via Adaptec AIC-7899 PCI SCSI card Software: Windows 2000 Server w/ SP2 Terminal Services for remote admin Veritas Backup Exec v8.6 Remotely Anywhere Dell Server Agents as follows- Dell OpenManage Server Agent v. 4.3.0 (BLD_2922) DRAC-II Server Monitoring SNMP MIB Agent v. 2.0, Firmware v. 2.40 Dell OpenManage Array Manager v. 3.0 Network Associates NetShield 4.5, current engine and DATs Executive Software Network Undelete v. 2 WQuinn Associates Storage CeNTral v.4.1 build 461 We use these servers only for file and print serving with no other funny software installed and no "unnecessary" services running. All flash-able components have been flashed to the current level and drivers are up-to-date. And during the installation of Veritas Backup Exec, we have the Veritas drivers installed for the backup devices. These servers reboot at random and leave no trace in the event logs, nothing in the hardware logs about any hardware issues. There is no blue screen and no Dr. Watson events, no system dumps, literally NOTHING to trace this to anything or give us any indication as to where to start looking. We have picked apart our build process, which BTW works absolutely perfectly on a Compaq server, and Dell has even taken one of our rebooting systems back to their labs for analysis, again to no avail. The failure rate for us was somewhere around 75-80% on these machines. It seemed for a while to be hardware, as we could sometimes replace the motherboard and memory and have the systems work again. But then we had repeat performances of the reboots. Systems will reboot sometimes immediately, sometimes they run for a month and a half before rebooting. We have stress-tested these systems using 2 or 3 different stress test packages, and these reboots haven't replicated in the lab but once. This is a real head-scratcher. Any thoughts? And remember, the easy things have more than likely already been thought of and tried, but I'm willing to entertain any ideas (and so is Dell at this point). Thanks all! -Chris ------ You are subscribed as [EMAIL PROTECTED] Archives: http://www.swynk.com/sitesearch/search.asp To unsubscribe send a blank email to %%email.unsub%% ------ You are subscribed as [email protected] Archives: http://www.swynk.com/sitesearch/search.asp To unsubscribe send a blank email to [EMAIL PROTECTED]
