Bug#983086: [Pkg-zfsonlinux-devel] Bug#983086: zfsutils-linux: TRIM crashes SSD drives

2021-02-22 Thread xavier

Hi Aron,


I'm particularly interested in the protocol used by these disks, I
suspect all of them are equipped with SATA 2.x/3.0?


Yes, it is SATA 3 disks on SATA 3 motherboards, Supermicro X10DRT-*.


Prior to SATA 3.1, TRIM command is considered blocking (non-queued)
and this might be the root cause of crashing your workload in a busy
environment. In other words, actively trimming on SATA 2.x/3.0 disks
could be considered harmful to the operational status of heavy
workloads even though the disks are enterprise graded with superfluous
IOPS capacity.


Thanks for your insight, that would explain our problem.


Thanks for this advice and we'll have a look on how to get something
landed for bullseye and buster-bpo.


Thank you!

Cheers,
Xavier



Bug#983086: [Pkg-zfsonlinux-devel] Bug#983086: zfsutils-linux: TRIM crashes SSD drives

2021-02-21 Thread Aron Xu
Hi,

On Fri, Feb 19, 2021 at 4:45 PM Xavier  wrote:
>
> Package: zfsutils-linux
> Version: 0.8.6-1~bpo10+1
> Severity: important
>
> Dear Maintainer,
>
> The recently added cron "TRIM the first Sunday of every month" makes some SSD 
> drives crash.
>
> The problem appears on reasonnably busy and otherwise stable servers:
>* with about 100 containers,
>* each on a separate zvol, ext4 mounted with discard option,
>* on a 6 identical drives raidz2.
>
> The issue has been observed on these drives:
>* Micron_5100_MTFDDAK960TCB
>* Samsung_SSD_850_EVO_1TB
>* Samsung_SSD_860_EVO_1TB
>

I'm particularly interested in the protocol used by these disks, I
suspect all of them are equipped with SATA 2.x/3.0?

Prior to SATA 3.1, TRIM command is considered blocking (non-queued)
and this might be the root cause of crashing your workload in a busy
environment. In other words, actively trimming on SATA 2.x/3.0 disks
could be considered harmful to the operational status of heavy
workloads even though the disks are enterprise graded with superfluous
IOPS capacity.

> When affected (it not always the case), the systems could not complete the 
> cancelling of the trim with:
> # zpool trim -c pool
> Testing trim on one drive only, and reducing the rate to as low as 50, 
> did not help.
>
> A reset seems the only solution, followed by a zpool trim -c after reboot.
>
> It would be wise to deactivate that cron by default, or at least to provide 
> some kind of convenient way to do so, like an option in /etc/default/zfs.
>

Thanks for this advice and we'll have a look on how to get something
landed for bullseye and buster-bpo.

Regards,
Aron