Hi everyone, I could use some help trying to get rid of some bad blocks on a RAID-5 on PERC 5/I controller.
Let me start by describing the setup: - 8x SATA 500GB HDD on PERC 5/I with Dell firmware 5.2.2-0072 - The 8 drives are setup as RAID-5 and shows up in RHEL 5.4 as /dev/sdc - /dev/sdc is formatted with XFS (XFS support from CentOS repository) - /dev/sdc is mounted at /data - for this discussion, let's call the disks 0:0, 0:1, 0:2, 0:3, 1:4, 1:5, 1:6, and 1:7 So, disk 1:6 all of sudden is marked as "failed" and /dev/sdc becomes degraded. Just in case another disk goes bad, we decided to take a backup at that moment; this would provide us with a copy of the "latest" instead of 1 day or older stuff. Our backup methodology is just a simple rsync of /data to an external drive. During the backup run, we noticed that there was one 4GB file that did not copy correctly. So, we used dd_rescue to make a copy of it but find out there are 8 blocks that are not readable. (blocks 5608072-5608079, the point is they are in a contiguous range, file system-wise) So, at this point, we're glad that we just did a backup of everything since now we're concerned that the rebuild of 1:6 might not succeed if there are unreadable sectors somewhere. Just to see what might happen, and since 1:6 seems to still be spinning, we decided to force it to rebuild without replacing it. To our surprise, the rebuild of 1:6 actually succeeded!?!??! We then run a 'omconfig storage vdisk action=checkconsistency controller=0 vdisk=0' and it completes successfully! Does checkconsistency make hard drives remap bad sectors? Now /dev/sdc is once again in good health, so we think, but we're suspicious of 1:6. At this point in time, we unmount /data and do xfs_check on it. it reports that "block ?/? type unknown not expected" and so we run xfs_repair on /data. everything seems to complete okay and we cleanly mount /data again. we go back to examine the 4GB file. We tried another dd_rescue on it but got the same exact results; 8 blocks in the same exact range as before did not read. When we use rsync to copy this file, we get the following console error messages: end_request: I/O error, dev sdc, sector 6012984362 end_request: I/O error, dev sdc, sector 6012984362 end_request: I/O error, dev sdc, sector 6012984874 end_request: I/O error, dev sdc, sector 6012984874 These messages makes me think that there are bad sectors on one or more of the disks? (would you people agree?) What I don't get is, we were having this problem in degraded mode (when 1:6 was in "failed" state), so how could it rebuild 1:6 with read errors somewhere other than 1:6? In the middle of investigating all this, disk 1:6 again goes into "failed" state. When we used 'omreport storage pdisk controller=0' some of the fields for 1:6 were filled with garbage. This time, we think 1:6 is really toast and decide to replace it with a spare we have. We put in the spare drive and it begins to rebuild. Again, we're not sure it will be able to successfully rebuild since we think there are read errors somewhere else. But to our surprise, the new 1:6 disk rebuilds successfully and /dev/sdc is once again in Status=Ok, State=Ready. We go back to investigating the 4GB file with bad blocks. We try another dd_rescue to copy it, but this time we get 16 bad blocks! The 1st 8 are exactly as before (5608072-5608079), but there were also blocks 5608584-5608591; the 2nd range of 8 blocks being contiguous but separate from the 1st 8. Problem getting worse? On the one hand, since the rebuild of 1:6 succeeded twice, this makes us think this is NOT a hard disk issue, but maybe a XFS issue? But, after the xfs_repair, xfs_check says /data is in good condition. And the "I/O error, dev sdc, sector XXXX" makes us think it could be a hard disk issue? Some thoughts? Advice? -Bond _______________________________________________ Linux-PowerEdge mailing list [email protected] https://lists.us.dell.com/mailman/listinfo/linux-poweredge Please read the FAQ at http://lists.us.dell.com/faq
