I have another suggestion: check the RAM, just in case, with memtest86 or https://github.com/martinwhitaker/pcmemtest (which is a fork of memtest86+). Ignore the suggestion if you have ECC RAM.
вт, 22 февр. 2022 г. в 15:45, Igor Fedotov <igor.fedo...@croit.io>: > > Hi Sebastian, > > On 2/22/2022 3:01 AM, Sebastian Mazza wrote: > > Hey Igor! > > > > > >> thanks a lot for the new logs - looks like they provides some insight. > > I'm glad the logs are helpful. > > > > > >> At this point I think the root cause is apparently a race between deferred > >> writes replay and some DB maintenance task happening on OSD startup. It > >> seems that deferred write replay updates a block extent which > >> RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros > >> content. Evidently that's just a matter of chance whether they use > >> conflicting physical extent or not hence the occasional nature of the > >> issue... > > > > Do I understand that correct: The corruption of the rocksDB (Table > > overwritten by zeros) happens at the first start of the OSD after “*** > > Immediate shutdown (osd_fast_shutdown=true) ***”? Before the system > > launches the OSD Service the RocksDB is still fine? > Looks like that. From logs I can see an unexpected write to specific > extent (LBA 0x630000) which shouldn't occur and at which RocksDB > subsequently fails. > > > > > >> So first of all I'm curious if you have any particular write patterns that > >> can be culprits? E.g. something like disk wiping procedure which writes > >> all-zeros to an object followed by object truncate or removal comes to my > >> mind. If you can identify something like that - could you please collect > >> OSD log for such an operation (followed by OSD restart) with > >> debug-bluestore set to 20? > > Best to my knowledge the OSD was hardly doing anything and I do not see any > > pattern that would fit to you explanation. > > However, you certainly understand a lot more about it than I do, so I try > > to explain everything that could be relevant. > > > > The Cluster has 3 Nodes. Each has a 240GB NVMe m.2 SSD as boot drive, which > > should not be relevant. Each node has 3 OSDs, one is on an U.2 NVMe SSD > > with 2TB and the other two are on 12TB HDDs. > > > > I have configured two crush rules ‘c3nvme’ and ‘ec4x2hdd’. The ‘c3nvme’ is > > a replicated rule that uses only OSDs with class ’nvme’. The second rule is > > a tricky erasure rule. It selects exactly 2 OSDs on exactly 4 Hosts with > > class ‘hdd’. So it only works for a size of exactly 8. That means that a > > pool that uses this rule has always only “undersized” placement groups, > > since the cluster has only 3 nodes. (I did not add the fourth server after > > the first crash in December, since we want to reproduce the problem.) > > > > The pools device_health_metrics, test-pool, fs.metadata-root-pool, > > fs.data-root-pool, fs.data-nvme.c-pool, and block-nvme.c-pool uses the > > crush rule ‘c3nvme’ with a size of 3 and a min size of 2. The pools > > fs.data-hdd.ec-pool, block-hdd.ec-pool uses the crush rule ‘ec4x2hdd’ with > > k=5,m=3 and a min size of 6. > > > > The pool fs.data-nvme.c-pool is not used and the pool test-pool was used > > for rados bench a few month ago. > > > > The pool fs.metadata-root-pool is used as metadata pool for cephFS and > > fs.data-root-pool as the root data pool for the cephFS. The pool > > fs.data-hdd.ec-pool is an additional data pool for the cephFS and is > > specified as ceph.dir.layout for some folders of the cephFS. The whole > > cephFS is mounted by each of the 3 nodes. > > > > The pool block-nvme.c-pool hosts two RBD images that are used as boot > > drives for two VMs. The first VM runes with Ubuntu Desktop and the second > > with Debian as OS. The pool block-hdd.ec-pool hosts one RBD image (the data > > part, metadata on block-nvme.c-pool) that is attached to the Debian VM as > > second drive formatted with BTRFS. Furthermore the Debian VM mounts a sub > > directory of the cephFS that has the fs.data-hdd.ec-pool set as layout. > > Both VMs was doing nothing, except from being booted, in the last couple of > > days. > > > > I try to illustrate the pool usage as a tree: > > * c3nvme (replicated, size=3, min_size=2) > > + device_health_metrics > > + test-pool > > - rados bench > > + fs.metadata-root-pool > > - CephFS (metadata) > > + fs.data-root-pool > > - CephFS (data root) > > + fs.data-nvme.c-pool > > + block-nvme.c-pool > > - RBD (Ubuntu VM, boot disk with ext4) > > - RBD (Debian VM, boot disk with ext4) > > * ec4x2hdd (ec, k=5, m=3, min_size=6) > > + fs.data-hdd.ec-pool > > - CephFS (layout data) > > + block-hdd.ec-pool > > - RBD (Debian VM, disk 2 with BTRFS) > > > > > > Last week I was writing about 30TB to the CephFS inside the > > fs.data-hdd.ec-pool and around 50GB to the BTRFS volume on the > > block-hdd.ec-pool. The 50GB written to BTRFS contains hundred thousands of > > hard links. There was already nearly the same data on the storage and I > > used rsync to update it. I think something between 1 and 4 TB has changed > > and was updated by rsync. > > > > I think the cluster was totally unused on Friday, but up and running and > > idling around. Then on Saturday I did a graceful shutdown of of all cluster > > nodes. Arround 5 minutes later when booted the servers again, the OSD.7 > > crashed. I copied the logs and exported the RocksDB. Then I deleted > > everything from the HDD and deployed the OSD.7 again. When, I checkt for > > the first time today at around 12:00, ceph was already finished with > > backfilling to OSD.7 and the cluster was idle again. > > > > I then spend 70 minutes with writing 3 small files (one with about 500Byte > > and two with about 2KB) to the cephFS on the fs.data-hdd.ec-pool and > > shutting down the whole cluster. After I booted all 3 nodes again I did > > overwrite the 3 small files (some Bytes have changed) and shouted down all > > 3 nodes. I repeated that procedure 4 times within 70 minutes. Then the OSD > > 7 crashed again. > > > > There may be was some more IO caused by mounting the RBD images with the > > BTRFS volume inside the Debian VM. But I do not know. I can only say that > > no user did use it. Only I was writing 4 times the 3 small files to cephFS > > on the fs.data-hdd.ec-pool. > > > > Does this make any sense to you? > Well, that's actually not a write pattern I would expect to cause the > issue. But thanks a lot anyway. So could you please repeat that > procedure with overwriting 3 small files and OSD restart and collect the > logs with debug-bluestore set to 20. > > > >> with debug-bluestore set to 20 > > Do you mean > > ``` > > [osd] > > debug bluestore = 20 > > ``` > > in ceph.conf? > yes. Or you might use 'ceph config set osd.7 debug-bluestore 20' instead > > > Thanks, > > Sebastian > > > > > > > > > >> On 21.02.2022, at 21:19, Igor Fedotov <igor.fedo...@croit.io> wrote: > >> > >> Hey Sebastian, > >> > >> thanks a lot for the new logs - looks like they provides some insight. At > >> this point I think the root cause is apparently a race between deferred > >> writes replay and some DB maintenance task happening on OSD startup. It > >> seems that deferred write replay updates a block extent which > >> RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros > >> content. Evidently that's just a matter of chance whether they use > >> conflicting physical extent or not hence the occasional nature of the > >> issue... > >> > >> And now I'd like to determine what's wrong with this deferred write replay. > >> > >> So first of all I'm curious if you have any particular write patterns that > >> can be culprits? E.g. something like disk wiping procedure which writes > >> all-zeros to an object followed by object truncate or removal comes to my > >> mind. If you can identify something like that - could you please collect > >> OSD log for such an operation (followed by OSD restart) with > >> debug-bluestore set to 20? > >> > >> > >> Thanks, > >> > >> Igor > >> > >> On 2/21/2022 5:29 PM, Sebastian Mazza wrote: > >>> Hi Igor, > >>> > >>> today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And > >>> this time I have logs with “debug bluefs = 20” and "debug bdev = 20” for > >>> every OSD in the cluster! It was the OSD with the ID 7 again. So the HDD > >>> has failed now the third time! Coincidence? Probably not… > >>> The important thing seams to be that a shutdown and not only a restart of > >>> the entire cluster is performed. Since, this time the OSD failed after > >>> just 4 shutdowns of all nodes in the cluster within 70 minutes. > >>> > >>> I redeployed the OSD.7 after the crash from 2 days ago. And I started > >>> this new shutdown and boot series shortly after ceph had finished writing > >>> everything back to OSD.7, earlier today. > >>> > >>> The corrupted RocksDB file (crash) is again only 2KB in size. > >>> You can download the RocksDB file with the bad table magic number and > >>> the log of the OSD.7 under this link: https://we.tl/t-e0NqjpSmaQ > >>> What else do you want? > >>> > >>> From the log of the OSD.7: > >>> ————— > >>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 > >>> /var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000 > >>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 > >>> /var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 0x55f0b8c7c910 r > >>> 4096 ioc 0x55f0b8dbdd18 with 0 aios left > >>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received signal: Terminated > >>> from /sbin/init (PID: 1) UID: 0 > >>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got signal > >>> Terminated *** > >>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Immediate > >>> shutdown (osd_fast_shutdown=true) *** > >>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00 0 set uid:gid to 64045:64045 > >>> (ceph:ceph) > >>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00 0 ceph version 16.2.6 > >>> (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process > >>> ceph-osd, pid 1967 > >>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00 0 pidfile_write: ignore empty > >>> --pid-file > >>> 2022-02-21T13:53:40.459+0100 7fc9645f4f00 1 bdev(0x55bd400a0800 > >>> /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block > >>> ————— > >>> > >>> For me this looks like that the OSD did nothing for nearly 2 minutes > >>> before it receives the termination request. Shouldn't this be enough time > >>> for flushing every imaginable write cache? > >>> > >>> > >>> I hope this helps you. > >>> > >>> > >>> Best wishes, > >>> Sebastian > >>> > >> -- > >> Igor Fedotov > >> Ceph Lead Developer > >> > >> Looking for help with your Ceph cluster? Contact us at https://croit.io > >> > >> croit GmbH, Freseniusstr. 31h, 81247 Munich > >> CEO: Martin Verges - VAT-ID: DE310638492 > >> Com. register: Amtsgericht Munich HRB 231263 > >> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx > > -- > Igor Fedotov > Ceph Lead Developer > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH, Freseniusstr. 31h, 81247 Munich > CEO: Martin Verges - VAT-ID: DE310638492 > Com. register: Amtsgericht Munich HRB 231263 > Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io -- Alexander E. Patrakov _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io