https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277992
Bug ID: 277992
Summary: mpr and possible trim issues
Product: Base System
Version: 14.0-STABLE
Hardware: amd64
OS: Any
Status: New
Severity: Affects Some People
Priority: ---
Component: kern
Assignee: [email protected]
Reporter: [email protected]
The thread
https://lists.freebsd.org/archives/freebsd-hardware/2024-March/000094.html has
most of the details.
In summary, a set of WD Blue SA510 SSDs with the latest firmware as of Mar 2024
will eventually start throwing errors and detach from the controller when I
copy and then destroy a zfs dataset with several million files. It sort of
feels like a TRIM issue, but not sure. Putting the disks off the onboard SATA
controller does not recreate the issue.
If I start with a low level trim (trim -f /dev/daX), create a raidz1 zfs pool
with 4, one TB WD disks, import a dataset of about 280GB (compressed) that has
many (20+mill files), do a zfs send original pool | zfs recv copy-of-pool, then
zfs destroy copy-of-pool and repeat about 4 or 5 times, the drives in the pool
will start throwing errors.
If I do a hard trim of the disks, I can start from scratch and again get 4 or 5
cycles before the errors. Hence, it sort of feels like a broken trim issue ?
I tried with auto trim on and off, a manual zfs trim <pool> between zfs send|
zfs recv tests to no avail. When the disks are on the mpr controller I will get
errors such as
(da6:mpr0:0:16:0): READ(10). CDB: 28 00 6d e0 ae 28 00 00 08 00
(da6:mpr0:0:16:0): CAM status: CCB request completed with an error
(da6:mpr0:0:16:0): Retrying command, 3 more tries remain
(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 0c cb 3f 00 00 00 e8 00
(da6:mpr0:0:16:0): CAM status: CCB request completed with an error
(da6:mpr0:0:16:0): Retrying command, 3 more tries remain
(da6:mpr0:0:16:0): READ(10). CDB: 28 00 6d e0 ad 28 00 01 00 00
(da6:mpr0:0:16:0): CAM status: CCB request completed with an error
(da6:mpr0:0:16:0): Retrying command, 3 more tries remain
(da6:mpr0:0:16:0): READ(10). CDB: 28 00 6d e0 ac 28 00 00 f8 00
(da6:mpr0:0:16:0): CAM status: CCB request completed with an error
(da6:mpr0:0:16:0): Retrying command, 3 more tries remain
(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 40 07 df 88 00 01 00 00
(da6:mpr0:0:16:0): CAM status: CCB request completed with an error
(da6:mpr0:0:16:0): Retrying command, 3 more tries remain
(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 3f 48 72 08 00 01 00 00
(da6:mpr0:0:16:0): CAM status: SCSI Status Error
(da6:mpr0:0:16:0): SCSI status: Check Condition
(da6:mpr0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset,
or bus device reset occurred)
(da6:mpr0:0:16:0): Retrying command (per sense data)
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 2036 loginfo
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 637 loginfo
31110f00
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 41 98 42 00 00 01 00 00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 1242 loginfo
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 979 loginfo
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 1243 loginfo
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 2091 loginfo
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 1612 loginfo
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 2093 loginfo
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 152 loginfo
31110f00
mpr0: Controller reported scsi ioc terminated tgt 15 SMID 2132 loginfo
31110f00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 43 17 dc 88 00 01 00 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 41 98 43 00 00 00 50 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 0c d4 f6 80 00 00 68 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 0c d4 f5 80 00 01 00 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): READ(10). CDB: 28 00 05 dc 12 28 00 00 f8 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): READ(10). CDB: 28 00 05 dc 0f b0 00 00 88 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 02 96 7e 80 00 00 10 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): READ(10). CDB: 28 00 6f 5b 8d 68 00 01 00 00
(da5:mpr0:0:15:0): CAM status: CCB request completed with an error
(da5:mpr0:0:15:0): Retrying command, 3 more tries remain
(da5:mpr0:0:15:0): WRITE(10). CDB: 2a 00 41 98 42 00 00 01 00 00
(da5:mpr0:0:15:0): CAM status: SCSI Status Error
(da5:mpr0:0:15:0): SCSI status: Check Condition
(da5:mpr0:0:15:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset,
or bus device reset occurred)
(da5:mpr0:0:15:0): Retrying command (per sense data)
The same tests with Samsung disks work without issue or at least I was not able
to recreate the error.
# mprutil show adapter
mpr0 Adapter:
Board Name: INSPUR 3008IT
Board Assembly: INSPUR
Chip Name: LSISAS3008
Chip Revision: ALL
BIOS Revision: 18.00.00.00
Firmware Revision: 16.00.12.00
Integrated RAID: no
SATA NCQ: ENABLED
PCIe Width/Speed: x8 (8.0 GB/sec)
IOC Speed: Full
Temperature: 56 C
I originally ran into this problem with the same series of LSI adapter, but it
was not in IT mode and instead was using the mrsas driver.
When on the ATA controller the disks are DSM_TRIM. When on MPR, they are
ATA_TRIM.
--
You are receiving this mail because:
You are the assignee for the bug.