Duplicate, please ignore. Apologies for the noise.
On Fri, 20 Aug 2021 at 06:34 +0200, Pouya Tafti wrote: > After a recent drive failure in my primary zfs pool, I set > up a secondary pool on a cgd(4) device on a single new sata > hdd (zfs on gpt on cgd on gpt on a 4TB Seagate Ironwolf > hdd) to back up the primary. > > I initialy scrubbed the entire disk without apparent > incident using a temporary cryptographic device and dd(1) > as in the cgdconfig(8) man page. > > Since then, twice already, in the past two days, the drive > has failed in the same way and been detached, once on the > very first zfs(8) create operation, and the second time > (after a reboot) after having written hundreds of GiBs to > it with a zfs(8) send/receive pipe. Here are the relevant > system messages: > > # dmesg > ... > [ 57131.573806] mpii0: physical device removed from slot 7 > [ 57131.573806] sd7d: error writing fsbn 1816866262 of 1816866262-1816866389 > (sd7 bn 1816866262; cn 894127 tn 1 sn 71) > [ 57131.573806] cgd0d: error writing fsbn 1816604078 of 1816604078-1816604205 > (cgd0 bn 1816604078; cn 887013 tn 0 sn 1454) > [ 57131.573806] sd7d: error reading fsbn 270904 of 270904-270919 (sd7 bn > 270904; cn 133 tn 5 sn 13) > [ 57131.573806] sd7d: error reading fsbn 7814028344 of 7814028344-7814028359 > (sd7 bn 7814028344; cn 3845486 tn 6 sn 30) > [ 57131.573806] sd7d: error reading fsbn 7814028856 of 7814028856-7814028871 > (sd7 bn 7814028856; cn 3845486 tn 10 sn 34) > [ 57131.573806] sd7: autoconfiguration error: cache synchronization failed > [ 57131.573806] cgd0d: error reading fsbn 7813766672 of 7813766672-7813766687 > (cgd0 bn 7813766672; cn 3815315 tn 0 sn 1552) > [ 57131.573806] cgd0d: error reading fsbn 7813766160 of 7813766160-7813766175 > (cgd0 bn 7813766160; cn 3815315 tn 0 sn 1040) > [ 57131.573806] cgd0d: error reading fsbn 8720 of 8720-8735 (cgd0 bn 8720; cn > 4 tn 0 sn 528) > [ 57131.573806] sd7d: error writing fsbn 1816866646 of 1816866646-1816866773 > (sd7 bn 1816866646; cn 894127 tn 4 sn 74) > [ 57131.573806] cgd0d: error writing fsbn 1816604462 of 1816604462-1816604589 > (cgd0 bn 1816604462; cn 887013 tn 0 sn 1838) > [ 57131.573806] sd7d: error writing fsbn 1816866518 of 1816866518-1816866645 > (sd7 bn 1816866518; cn 894127 tn 3 sn 73) > [ 57131.573806] cgd0d: error writing fsbn 1816604334 of 1816604334-1816604461 > (cgd0 bn 1816604334; cn 887013 tn 0 sn 1710) > [ 57131.593815] sd7: autoconfiguration error: cache synchronization failed > [ 57131.643840] dk11 at sd7 (backupcgd0) deleted > [ 57131.643840] dk10 at sd7 (backupcgd0.config) deleted > [ 57131.643840] sd7: detached > > I don't know how to go about diagnosing the issue and would > appreciate any suggestions. In particular, the hdd is new > and I wonder if I should return it for a replacement. The > previous disk in the same bay had also been showing > read/write errors (the other drive never got detached, > though). > > Apart from the drive, I have also little faith in the > backplate, cables, SAS controller (which I reflashed), RAM, > etc., although here it looks to me like the problem could > be somewhere between the drive and the controller. > > Many thanks, > Pouya > > N.B. I'm also a bit confused by how zfs is handling this: > zpool(8) appears to think the drive is still online, while > zfs(8) doesn't list any datasets on it: > > # zpool status -v puddle > pool: puddle > state: ONLINE > status: One or more devices are faulted in response to IO failures. > action: Make sure the affected devices are connected, then run 'zpool clear'. > see: http://illumos.org/msg/ZFS-8000-HC > scan: none requested > config: > > NAME STATE READ WRITE CKSUM > puddle ONLINE 0 3.62K 0 > wedges/backup0 ONLINE 0 213 0 > > errors: Permanent errors have been detected in the following files: > > puddle/backup.pond/backup:<0x0> > puddle/backup.pond/backup:<0x10ecc5> > > # zfs list puddle > cannot open 'puddle': pool I/O is currently suspended >