Hi, On Thu, Oct 07, 2010 at 02:53:22PM -0400, Peter Olson via RT wrote: > > [beuc - Wed Oct 06 15:21:47 2010]: > > Hi, > > > > On Wed, Oct 06, 2010 at 03:05:04PM -0400, Peter Olson via RT wrote: > > > > [beuc - Wed Oct 06 14:46:46 2010]: > > > > > > > > Hi, > > > > > > > > Disk 'sdd' is not available anymore at colonialone. > > > > > > > > Smartmontools detected an issue, and mdadm removed it from the > > RAID > > > > array. > > > > > > > > Can you investigate and possibly replace the failed disk? > > > > > > > > Btw, did you receive the failure notifications? > > > > > > > > Thanks, > > > > > > We took the failed disk out of the RAID array because it appears to > > be a hard failure rather than a > > > glitch (all partitions containing the disk degraded at the same > > time). > > > > > > The array contained 4 members and now contains 3 members, all in > > service. We expect to replace it when > > > we next make a trip to the colo. > > > > > > colonialone:~# cat /proc/mdstat > > > Personalities : [raid1] > > > md3 : active raid1 sda6[0] sdb6[2] sdc6[1] > > > 955128384 blocks [3/3] [UUU] > > > > > > md2 : active raid1 sda5[0] sdb5[2] sdc5[1] > > > 19534976 blocks [3/3] [UUU] > > > > > > md1 : active raid1 sda2[0] sdb2[2] sdc2[1] > > > 2000000 blocks [3/3] [UUU] > > > > > > md0 : active raid1 sda1[0] sdb1[2] sdc1[1] > > > 96256 blocks [3/3] [UUU] > > > > > > unused devices: <none> > > > > > > I'm worried that 'dmesg' shows lots of ext3 errors. > > > > How can a failed disk in a RAID1x4 array cause *filesystem*-level > > errors? > > > > Do we need a fsck or something? > > Here are some of the errors from dmesg: > > [20930306.805714] ext3_orphan_cleanup: deleting unreferenced inode 86646 > [20930306.805714] ext3_orphan_cleanup: deleting unreferenced inode 85820 > [20930306.822520] ext3_orphan_cleanup: deleting unreferenced inode 86643 > [20930306.829335] ext3_orphan_cleanup: deleting unreferenced inode 86645 > [20930306.840398] EXT3-fs: dm-5: 30 orphan inodes deleted > [20930306.840542] EXT3-fs: recovery complete. > [20930307.015205] EXT3-fs: mounted filesystem with ordered data mode. > > I found some discussion on the Net that says these messages are a normal > byproduct of making an LVM > snapshot. Are you doing this as part of your backup procedure?
Yes (cf. remote_backup.sh). Good to know it's not a disk error, thanks. > I wrote a script to convert dmesg timestamps to wall clock. These messages > are issued every morning > between 07:58 and 08:15 (or sometimes as late as 08:27). Yes, the backup from savannah-backup.gnu.org runs at 12:00 GMT. Also, LVM is still looking for /dev/sdd7 colonialone:~# lvs /dev/sdd7: read failed after 0 of 2048 at 0: Input/output error [...] I suggest we plan a reboot this week. -- Sylvain
