On a PVFS2 cluster ( 2 machines, 30 TB each) I'm having data corruption problems. I don't really understand what's happening, because it keeps showing up at random and not on every file. The main symptom is that the md5 of the files change sometimes. Apparently it goes worse as time goes by : when I just restarted the cluster, it hardly occurs; after a week, most files appeared as complete garbage. It occurs more frequently on very big files ( from 4 to 30 GB) than smaller ones.
(I'm using dd to read from the filesystem because md5 reads very slowly from the pvfs mount). cluster2:/mnt/cluster/BAD# for i in 1 2 3 4 5 6 ; do dd if=ES14429 bs=1M 2>/dev/null | md5sum ; done ca02ca0b5814bba6d8a9528d9f624c64 - ca02ca0b5814bba6d8a9528d9f624c64 - ca02ca0b5814bba6d8a9528d9f624c64 - 98c9d2849cadc9578cfa056fe620a070 - ca02ca0b5814bba6d8a9528d9f624c64 - ca02ca0b5814bba6d8a9528d9f624c64 - As you can see, the file appears correct 4 or 5 times out of 6! It usually cycles thru 2 or 3 different checksums so the errors are somewhat consistent!. What is going on? There isn't a single message coming from either pvfs2 client or server, nothing in dmesg, I don't understand! When I'm running the same script on the other server (cluster1) the problem is much rarer. Actually right now I keep cheksumming this file from cluster1 again and again without any error. Does anyone have any idea about what may be going on? Can it be a network error? RAM is ECC in this machines, disks are set up in RAID-6. -- ---------------------------------------- Emmanuel Florac | Intellique ---------------------------------------- _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
