[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Alexander E. Patrakov Tue, 22 Feb 2022 17:05:01 -0800

I have another suggestion: check the RAM, just in case, with memtest86
or https://github.com/martinwhitaker/pcmemtest (which is a fork of
memtest86+). Ignore the suggestion if you have ECC RAM.


вт, 22 февр. 2022 г. в 15:45, Igor Fedotov <igor.fedo...@croit.io>:
>
> Hi Sebastian,
>
> On 2/22/2022 3:01 AM, Sebastian Mazza wrote:
> > Hey Igor!
> >
> >
> >> thanks a lot for the new logs - looks like they provides some insight.
> > I'm glad the logs are helpful.
> >
> >
> >> At this point I think the root cause is apparently a race between deferred 
> >> writes replay and some DB maintenance task happening on OSD startup. It 
> >> seems that deferred write replay updates a block extent which 
> >> RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros 
> >> content. Evidently that's just a matter of chance whether they use 
> >> conflicting physical extent or not hence the occasional nature of the 
> >> issue...
> >
> > Do I understand that correct: The corruption of the rocksDB (Table 
> > overwritten by zeros) happens at the first start of the OSD after  “*** 
> > Immediate shutdown (osd_fast_shutdown=true) ***”? Before the system 
> > launches the OSD Service the RocksDB is still fine?
> Looks like that. From logs I can see an unexpected write to specific
> extent (LBA 0x630000) which shouldn't occur and at which RocksDB
> subsequently fails.
> >
> >
> >> So first of all I'm curious if you have any particular write patterns that 
> >> can be culprits? E.g. something like disk wiping procedure which writes 
> >> all-zeros to an object followed by object truncate or removal comes to my 
> >> mind. If you can identify something like that - could you please collect 
> >> OSD log for such an operation (followed by OSD restart) with 
> >> debug-bluestore set to 20?
> > Best to my knowledge the OSD was hardly doing anything and I do not see any 
> > pattern that would fit to you explanation.
> > However, you certainly understand a lot more about it than I do, so I try 
> > to explain everything that could be relevant.
> >
> > The Cluster has 3 Nodes. Each has a 240GB NVMe m.2 SSD as boot drive, which 
> > should not be relevant. Each node has 3 OSDs, one is on an U.2 NVMe SSD 
> > with 2TB and the other two are on 12TB HDDs.
> >
> > I have configured two crush rules ‘c3nvme’ and ‘ec4x2hdd’. The ‘c3nvme’ is 
> > a replicated rule that uses only OSDs with class ’nvme’. The second rule is 
> > a tricky erasure rule. It selects exactly 2 OSDs on exactly 4 Hosts with 
> > class ‘hdd’. So it only works for a size of exactly 8. That means that a 
> > pool that uses this rule has always only “undersized” placement groups, 
> > since the cluster has only 3 nodes. (I did not add the fourth server after 
> > the first crash in December, since we want to reproduce the problem.)
> >
> > The pools device_health_metrics, test-pool, fs.metadata-root-pool, 
> > fs.data-root-pool, fs.data-nvme.c-pool, and block-nvme.c-pool uses the 
> > crush rule ‘c3nvme’ with a size of 3 and a min size of 2. The pools 
> > fs.data-hdd.ec-pool, block-hdd.ec-pool uses the crush rule ‘ec4x2hdd’ with 
> > k=5,m=3 and a min size of 6.
> >
> > The pool fs.data-nvme.c-pool is not used and the pool test-pool was used 
> > for rados bench a few month ago.
> >
> > The pool fs.metadata-root-pool is used as metadata pool for cephFS and 
> > fs.data-root-pool as the root data pool for the cephFS. The pool 
> > fs.data-hdd.ec-pool is an additional data pool for the cephFS and is 
> > specified as ceph.dir.layout for some folders of the cephFS. The whole 
> > cephFS is mounted by each of the 3 nodes.
> >
> > The pool block-nvme.c-pool hosts two RBD images that are used as boot 
> > drives for two VMs. The first VM runes with Ubuntu Desktop and the second 
> > with Debian as OS. The pool block-hdd.ec-pool hosts one RBD image (the data 
> > part, metadata on block-nvme.c-pool) that is attached to the Debian VM as 
> > second drive formatted with BTRFS. Furthermore the Debian VM mounts a sub 
> > directory of the cephFS that has the fs.data-hdd.ec-pool set as layout. 
> > Both VMs was doing nothing, except from being booted, in the last couple of 
> > days.
> >
> > I try to illustrate the pool usage as a tree:
> > * c3nvme (replicated, size=3, min_size=2)
> >      + device_health_metrics
> >      + test-pool
> >          - rados bench
> >      + fs.metadata-root-pool
> >          - CephFS (metadata)
> >      + fs.data-root-pool
> >          - CephFS (data root)
> >      + fs.data-nvme.c-pool
> >      + block-nvme.c-pool
> >          - RBD (Ubuntu VM, boot disk with ext4)
> >       - RBD (Debian VM, boot disk with ext4)
> > * ec4x2hdd (ec,  k=5, m=3, min_size=6)
> >      + fs.data-hdd.ec-pool
> >          - CephFS (layout data)
> >      + block-hdd.ec-pool
> >          - RBD (Debian VM, disk 2 with BTRFS)
> >
> >
> > Last week I was writing about 30TB to the CephFS inside the 
> > fs.data-hdd.ec-pool and around 50GB to the BTRFS volume on the 
> > block-hdd.ec-pool. The 50GB written to BTRFS contains hundred thousands of 
> > hard links. There was already nearly the same data on the storage and I 
> > used rsync to update it. I think something between 1 and 4 TB has changed 
> > and was updated by rsync.
> >
> > I think the cluster was totally unused on Friday, but up and running and 
> > idling around. Then on Saturday I did a graceful shutdown of of all cluster 
> > nodes. Arround 5 minutes later when booted the servers again, the OSD.7 
> > crashed. I copied the logs and exported the RocksDB. Then I deleted 
> > everything from the HDD and deployed the OSD.7 again. When, I checkt for 
> > the first time today at around 12:00, ceph was already finished with 
> > backfilling to OSD.7 and the cluster was idle again.
> >
> > I then spend 70 minutes with writing 3 small files (one with about 500Byte 
> > and two with about 2KB) to the cephFS on the fs.data-hdd.ec-pool and 
> > shutting down the whole cluster. After I booted all 3 nodes again I did 
> > overwrite the 3 small files (some Bytes have changed) and shouted down all 
> > 3 nodes. I repeated that procedure 4 times within 70 minutes. Then the OSD 
> > 7 crashed again.
> >
> > There may be was some more IO caused by mounting the RBD images with the 
> > BTRFS volume inside the Debian VM. But I do not know. I can only say that 
> > no user did use it. Only I was writing 4 times the 3 small files to cephFS 
> > on the fs.data-hdd.ec-pool.
> >
> > Does this make any sense to you?
> Well, that's actually not a write pattern I would expect to cause the
> issue. But thanks a lot anyway. So could you please repeat that
> procedure with overwriting 3 small files and OSD restart and collect the
> logs with debug-bluestore set to 20.
> >
> >> with debug-bluestore set to 20
> > Do you mean
> > ```
> > [osd]
> >       debug bluestore = 20
> > ```
> > in ceph.conf?
> yes.  Or you might use 'ceph config set osd.7 debug-bluestore 20' instead
>
> > Thanks,
> > Sebastian
> >
> >
> >
> >
> >> On 21.02.2022, at 21:19, Igor Fedotov <igor.fedo...@croit.io> wrote:
> >>
> >> Hey Sebastian,
> >>
> >> thanks a lot for the new logs - looks like they provides some insight. At 
> >> this point I think the root cause is apparently a race between deferred 
> >> writes replay and some DB maintenance task happening on OSD startup. It 
> >> seems that deferred write replay updates a block extent which 
> >> RocksDB/BlueFS are using. Hence the target BlueFS file gets all-zeros 
> >> content. Evidently that's just a matter of chance whether they use 
> >> conflicting physical extent or not hence the occasional nature of the 
> >> issue...
> >>
> >> And now I'd like to determine what's wrong with this deferred write replay.
> >>
> >> So first of all I'm curious if you have any particular write patterns that 
> >> can be culprits? E.g. something like disk wiping procedure which writes 
> >> all-zeros to an object followed by object truncate or removal comes to my 
> >> mind. If you can identify something like that - could you please collect 
> >> OSD log for such an operation (followed by OSD restart) with 
> >> debug-bluestore set to 20?
> >>
> >>
> >> Thanks,
> >>
> >> Igor
> >>
> >> On 2/21/2022 5:29 PM, Sebastian Mazza wrote:
> >>> Hi Igor,
> >>>
> >>> today (21-02-2022) at 13:49:28.452+0100, I crashed the OSD 7 again. And 
> >>> this time I have logs with “debug bluefs = 20” and "debug bdev = 20” for 
> >>> every OSD in the cluster! It was the OSD with the ID 7 again. So the HDD 
> >>> has failed now the third time! Coincidence? Probably not…
> >>> The important thing seams to be that a shutdown and not only a restart of 
> >>> the entire cluster is performed. Since, this time the OSD failed after 
> >>> just 4 shutdowns of all nodes in the cluster within 70 minutes.
> >>>
> >>> I redeployed the OSD.7 after the crash from 2 days ago. And I started 
> >>> this new shutdown and boot series shortly after ceph had finished writing 
> >>> everything back to OSD.7, earlier today.
> >>>
> >>> The corrupted RocksDB file (crash) is again only 2KB in size.
> >>> You can download the RocksDB file with the bad  table magic number and 
> >>> the log of the OSD.7 under this link: https://we.tl/t-e0NqjpSmaQ
> >>> What else do you want?
> >>>
> >>>  From the log of the OSD.7:
> >>> —————
> >>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 20 bdev(0x55f088a27400 
> >>> /var/lib/ceph/osd/ceph-7/block) _aio_log_finish 1 0x96d000~1000
> >>> 2022-02-21T13:47:39.945+0100 7f6fa3f91700 10 bdev(0x55f088a27400 
> >>> /var/lib/ceph/osd/ceph-7/block) _aio_thread finished aio 0x55f0b8c7c910 r 
> >>> 4096 ioc 0x55f0b8dbdd18 with 0 aios left
> >>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 received  signal: Terminated 
> >>> from /sbin/init  (PID: 1) UID: 0
> >>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Got signal 
> >>> Terminated ***
> >>> 2022-02-21T13:49:28.452+0100 7f6fa8a34700 -1 osd.7 4711 *** Immediate 
> >>> shutdown (osd_fast_shutdown=true) ***
> >>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 set uid:gid to 64045:64045 
> >>> (ceph:ceph)
> >>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 ceph version 16.2.6 
> >>> (1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process 
> >>> ceph-osd, pid 1967
> >>> 2022-02-21T13:53:40.455+0100 7fc9645f4f00  0 pidfile_write: ignore empty 
> >>> --pid-file
> >>> 2022-02-21T13:53:40.459+0100 7fc9645f4f00  1 bdev(0x55bd400a0800 
> >>> /var/lib/ceph/osd/ceph-7/block) open path /var/lib/ceph/osd/ceph-7/block
> >>> —————
> >>>
> >>> For me this looks like that the OSD did nothing for nearly 2 minutes 
> >>> before it receives the termination request. Shouldn't this be enough time 
> >>> for flushing every imaginable write cache?
> >>>
> >>>
> >>> I hope this helps you.
> >>>
> >>>
> >>> Best wishes,
> >>> Sebastian
> >>>
> >> --
> >> Igor Fedotov
> >> Ceph Lead Developer
> >>
> >> Looking for help with your Ceph cluster? Contact us at https://croit.io
> >>
> >> croit GmbH, Freseniusstr. 31h, 81247 Munich
> >> CEO: Martin Verges - VAT-ID: DE310638492
> >> Com. register: Amtsgericht Munich HRB 231263
> >> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

Reply via email to