While I've seen my dirvish banks running on reiserfs formatted drives get corrupt and lock up a server, I had never seen it with ext2/3 drives. I THOUGHT I had just run across an ext2/3 file system corruption = server hang but now I'm beginning to wonder.
The dirvish restore to the new server hardware went smooth (and mostly without hiccups -- there were a few drivers I had to compile for the new SCSI card and nics) and I was looking forward to smooth sailing. Unfortunately that hasn't been the case. I have been having issues on this new hardware. Attempting heavy write access to a USB drive containing one of my banks causes the server to lockup. The file system contained errors and since I had a backup copy of the vaults in that bank I decided to backup my vault configs and reformat the drive fresh. I had previously reformatted reiserfs filesystems to 'fix' corruption that caused lockups and I was surprised that I seemed to have the same issue with ext2/3. When I kicked off a reformat on the dirvish bank drive, the server wrote about 147 of it's inode allocations and then the server just paused. At first the server was still pingable, but that quickly deteriorated. The numlock worked but the console was unresponsive. Use of the Magic SysRq commands allowed me to Sync, Unmount and reBoot the server mostly gracefully but now I am wondering what technical situations could lead to a server hanging on USB disk access. The 2.4.20 kernel that is running was stable on the old hardware(yes, I know...that was the OLD hardware)...I fear that a kernel upgrade will be necessary on this new hardware but I'm hoping someone else on the list has seen a problem similar to this one and can offer suggestions. I am not looking forward to dealing with getting an upgraded kernel patched , compatible and ready to run Dead-Gateway-Detection (DGD), mppe, and uml processes only to find that the problem is hardware related, BIOS setting related or some other such cause. Troubleshooting steps taken so far include removing the add-on USB PCI card and disabling SMP in the kernel( so that processes on the server would quit going into Un-interruptible sleep mode(D)) -- Richard _______________________________________________ Dirvish mailing list [email protected] http://www.dirvish.org/mailman/listinfo/dirvish
