Hi all,
Apologies if this is a bit off topic for some, but all of my music is
on this device, and well, I have run out of ideas to diagnose the fault
.... I'm hoping somewhere here may have seen something like this in
their travels.
I have a FC7 server, running 2.6.23.17-88.fc7 (just updated).
I have a md raid5 set, of four 300GB volumes.
The problem is that data is coming off this md device inconsistantly.
For example if I run md5sum on say 1000 files, and then run it again on
the same 1000 files, I will have 20 that are different. If I run it
again, I will have 28 that are different, and generally all of those
files will be different files from the first 20 that were different.
If one of the same files happens to be different in both the second and
third md5sum run, it will have three different checksum values.
>From a purely abstracted viewpoint, I am quite curious as to what is
going on. From a personal viewpoint I am fairly worried, as I
certainly don't have everything on this device backed up, and more to
the point when did this issue first happen?
Diagnosis
I have run a lot of tests, and I am loath to implement any of the
repair / write aspects of them, as I will be actually corrupting what I
believe to be consistant data on the actual array.
For example, running a consistancy check on the array
Code:
--------------------
echo check > /sys/block/md1/md/sync_action
--------------------
and then waiting for a couple of hours
(to check the progress)
Code:
--------------------
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md1 : active raid5 sdc1[0] sdf1[3] sde1[2] sdd1[1]
879100416 blocks level 5, 128k chunk, algorithm 2 [4/4] [UUUU]
[=====>...............] check = 25.6% (75090008/293033472) finish=128.4min
speed=28286K/sec
--------------------
Once finished
Code:
--------------------
cat /sys/block/md1/md/mismatch_cnt
--------------------
gives back around 10000 metablocks with errors. Each time that I run
this I get a different count. For those that are interested each block
is 128KB (from what I have read), so it is a fair wack. This test is
comparing the actual data on the array to the parity data.
My theory
Under high load / or a load threshold, errors are being introduced in
the data coming off the array. I have noticed when doing checks with
larger files, whenever there is a corruption, that process has
stuttered reading the data from the array.
The big question is why?! and what is it?
Appreciate anyone else with other ideas / angles
Cheers
Steve
--
mctubster
------------------------------------------------------------------------
mctubster's Profile: http://forums.slimdevices.com/member.php?userid=353
View this thread: http://forums.slimdevices.com/showthread.php?t=48881
_______________________________________________
discuss mailing list
[email protected]
http://lists.slimdevices.com/lists/listinfo/discuss