On 6/1/26 14:15, Charles Curley wrote:
Some additional testing.Suspecting a bad hard drive, I ran more extended tests on all four members of the RAID array. One showed problems: "Error 1 [0] occurred at disk power-on lifetime: 6777 hours (282 days + 9 hours)", " When the command that caused the error occurred, the device was active or idle.", "", " After command completion occurred, registers were:", " ER -- ST COUNT LBA_48 LH LM LL DV DC", " -- -- -- == -- == == == -- -- -- -- --", " 40 -- 51 00 01 00 00 00 00 00 00 40 00 Error: UNC 1 sectors at LBA = 0x00000000 = 0", "", " Commands leading to the command that caused the error were:", " CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name", " -- == -- == -- == == == -- -- -- -- -- --------------- --------------------", " 25 00 00 00 01 00 00 00 00 00 00 40 00 00:08:36.585 READ DMA EXT", " ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.545 IDENTIFY DEVICE", " b0 00 da 00 00 00 00 00 c2 4f 00 00 00 00:08:31.542 SMART RETURN STATUS", " b0 00 d2 00 f1 00 00 00 c2 4f 00 00 00 00:08:31.541 SMART ENABLE/DISABLE ATTRIBUTE AUTOSAVE", " ec 00 00 00 00 00 00 00 00 00 00 00 00 00:08:31.541 IDENTIFY DEVICE", "", "SMART Extended Self-test Log Version: 1 (1 sectors)", "Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error", "# 1 Extended offline Completed without error 00% 6756 -", "# 2 Extended offline Completed without error 00% 6573 -", "# 3 Extended offline Completed without error 00% 102 -", "# 4 Short offline Completed without error 00% 96 -", "", So I did the obvious: I failed and remove the drive from the array. The problem still showed up, but not as many fails in the same data set. I have since added the drive back to the array, and am testing the array now. mdadm --monitor --test --oneshot /dev/md0 I begin to wonder if I have a bad motherboard.
Up until 2019, I was using Debian GNU/Linux on desktop hardware as a file server. When I upgraded to a server motherboard and ECC memory, I started seeing DMA errors. During trouble-shooting, I realized that I had been collecting SATA parts since the days of SATA I 150 Gbps -- HBA's, cables, racks, and drawers. My file server had a mix of various known and unknown parts, including red SATA cables (red dye can cause copper conductors to oxidize into dust). So, I replaced all of the unknown and obsolete parts with new parts clearly rated and marked for SATA III 6 Gbps. The disk problems went away. When I wanted more HDD's, I bought SAS 6 Gbps HBA's, cables, and HDD's.
Similarly, most of the memory problems I encountered were caused by incompatibility between the motherboard and the memory module(s). I suggest documenting your motherboard, documenting your memory modules, and doing the homework. Memory manufacturers typically have a search feature on their web site that will produce a list of compatible memory modules given a computer or motherboard make and model. eBay sellers often include the computer/motherboard make/model for pulled memory modules. And, you can always STFW.
For a server, I prefer and recommended workstation/server motherboards, ECC memory, ext4/UFS for the system disk, and ZFS RAID10 for data.
David

