We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk
(partitioned), 3 disks per node, 5 nodes per cluster.  The clusters are
12.2.4 running CephFS and RBDs.  So in total we have 15 NVMe's per cluster
and 30 NVMe's in total.  They were all built at the same time and were
running firmware version QDV10130.  On this firmware version we early on
had 2 disks failures, a few months later we had 1 more, and then a month
after that (just a few weeks ago) we had 7 disk failures in 1 week.

The failures are such that the disk is no longer visible to the OS.  This
holds true beyond server reboots as well as placing the failed disks into a
new server.  With a firmware upgrade tool we got an error that pretty much
said there's no way to get data back and to RMA the disk.  We upgraded all
of our remaining disks' firmware to QDV101D1 and haven't had any problems
since then.  Most of our failures happened while rebalancing the cluster
after replacing dead disks and we tested rigorously around that use case
after upgrading the firmware.  This firmware version seems to have resolved
whatever the problem was.

We have about 100 more of these scattered among database servers and other
servers that have never had this problem while running the
QDV10130 firmware as well as firmwares between this one and the one we
upgraded to.  Bluestore on Ceph is the only use case we've had so far with
this sort of failure.

Has anyone else come across this issue before?  Our current theory is that
Bluestore is accessing the disk in a way that is triggering a bug in the
older firmware version that isn't triggered by more traditional
filesystems.  We have a scheduled call with Intel to discuss this, but
their preliminary searches into the bugfixes and known problems between
firmware versions didn't indicate the bug that we triggered.  It would be
good to have some more information about what those differences for disk
accessing might be to hopefully get a better answer from them as to what
the problem is.


[1]
https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to