On Fri, 20 Aug 2021 at 06:13 -0000, Michael van Elst wrote: > [snip] > > Yes. It could be the drive itself, but I'd suspect the > > backplane or cables. The PSU is also a possible candidate.
On Fri, 20 Aug 2021 at 09:31 +0200, Pouya Tafti wrote: > Thanks. Retrying the replication in another bay now before > opening up the box. The replication progressed for a few hours and then came to a halt without any errors (IO rates just dropped to zero), with zpool(8) history and other access operations (e.g. ls) entered an unresponsive D (uninterruptible wait) state according to ps(1) (although zpool status kept reporting everything as ONLINE with no errors). Operations on the other pool not including the new device were also similarly unresponsive. I was not able to kill the processes or have a clean shutdown and had to power-cycle the system. Looking at the logs, this time the device wasn't detached by the controller, but smartd(8) logged some read errors throughout the day. But these also kept showing up before the pool became unresponsive. zpool status shows no errors and I did a successful scrub of both pools (primary and backup) after reboot. Although the fact that zfs doesn't see the errors may also have to do with the drive being hidden behind cgd(4). I don't really know what to make of the errors or the fact that zfs suddenly became unresponsive, also on the other pool not including this device. # uname -a NetBSD basil 9.2_STABLE NetBSD 9.2_STABLE (GENERIC) #0: Wed Jul 14 18:05:25 UTC 2021 mkre...@mkrepro.netbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC amd64 # cat /var/log/messages [snip] Aug 20 06:04:33 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 65 to 79 Aug 20 06:04:33 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 7 Seek_Error_Rate changed from 72 to 73 [more of the same] Aug 20 11:34:33 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 29 Aug 20 12:00:00 basil syslogd[822]: restart Aug 20 12:04:33 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 83 [more of the same] Aug 20 15:04:34 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 79 Aug 20 15:04:34 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 7 Seek_Error_Rate changed from 73 to 74 Aug 20 15:34:33 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 79 to 82 Aug 20 16:00:00 basil syslogd[822]: restart Aug 20 16:04:33 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 84 [more of the same] Aug 20 18:04:33 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 81 to 82 Aug 20 22:04:34 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 72 Aug 20 22:04:34 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 28 => this was when I first noticed IO had stopped after transferring a little short of 1TB during the day. Aug 21 02:34:34 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 73 Aug 21 02:34:34 basil smartd[1106]: Device: /dev/rsd5d [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 27 Aug 21 06:15:13 basil syslogd[791]: restart Aug 21 07:45:36 basil smartd[1092]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 83 to 84 [more of the same] Aug 21 08:15:36 basil smartd[1092]: Device: /dev/rsd5d [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 84 to 79 Aug 21 08:15:36 basil smartd[1092]: Device: /dev/rsd5d [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 71 Aug 21 08:15:36 basil smartd[1092]: Device: /dev/rsd5d [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 29 [more of the same]