This is rather long - so if you're replying to just one bit, please consider trimming the parts that you're not responding to to make everybody's life a little bit better!
Some time ago I wrote about a data corruption issue. I've still not managed to track it down but I have two new datapoints one (inspired but a recent thread) and I'm hoping someone will have ideas how I should move forward. By avoiding heavy disk load (and important tasks/jobs!) on the problem machine I've had no more data corruption. There are no errors/warnings anywhere. A part of me is suspecting a faulty SSD! I have new disks on order so I can replace the existing disks soon if that's what it will need to fix this. Inspired from the recent thread: On the server that has no issues: sda: Sector size (logical/physical): 512 bytes / 512 bytes sdb: Sector size (logical/physical): 512 bytes / 512 bytes These are then gpt partitioned, a small BIOS boot and EFI partition and then a big "Linux filesystem" partition that is part of a mdadm raid md0 : active raid1 sda3[3] sdb3[2] On the server that has performance issues and I get occasional data corruption (both reading and writing) under heavy (disk) load: sda: Sector size (logical/physical): 512 bytes / 512 bytes sdb: Sector size (logical/physical): 512 bytes / 4096 bytes I'm wondering if that physical sector size is the issue. All the partitions start on a 4k boundary but the big partition is not an exact multiple of 4k. Inside the raid is a LVM PV so I think everything is 4K aligned anyway except the filesystems themselves and the "heavy load" filesystem that triggered the issue uses 4k blocks. But I don't know if something somewhere has "padding" so that the actual data doesn't actually start on a 4k boundary on the disk. There are a LOT of partitions and filesystems in a complicated layered LVM setup so it will be easier for me to check with instructions than to try to provide the data for someone else to check - if someone can give me instructions to work out exactly where the data ends up on the disk. (all partitions are formatted with ext3) The remaining setup is identical The new disks are the same make and model as sdb in this server - I hope that's not a problem! The second datapoint. My VMs all use iscsi to provide their disk. Normally the vm runs on the same server as the iscsi target but today I did a kernel upgrade on a pair of vms (the one on the "problem" machine took about twice as long) and then "cross booted" them and purged the old kernel. I actually took timings here: Booted on the problem machine but physical disk still on the OK machine: real 0m35.731s user 0m5.291s sys 0m4.677s Booted on the good machine but physical disk still on the problem machine: real 0m57.721s user 0m5.446s sys 0m4.783s I was running these at the same time - which I think rules out cpu issues. (I've done other tests that also suggest that cpu/memory isn't the issue, it seems to be disk, cabling etc). The SMART attributes from the problem machine: sda: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 18280 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 54 177 Wear_Leveling_Count 0x0013 087 087 000 Pre-fail Always - 129 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0 181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0032 067 049 000 Old_age Always - 33 195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0 199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0 235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 39 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 62154466086 sdb: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0 5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 18697 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 50 171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0 173 Ave_Block-Erase_Count 0x0032 067 067 000 Old_age Always - 433 174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 12 180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 45 183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0 184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 074 052 000 Old_age Always - 26 (Min/Max 0/48) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 1 202 Percent_Lifetime_Remain 0x0030 067 067 001 Old_age Offline - 33 206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0 210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0 246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 63148678276 247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 1879223820 248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 1922002147 Does anything leap out at anyone? Anything I should try next? Normally I try and avoid having disks bought at the same time from the same brand paired together but I'll give that a try if it will fix this. Tim.