Awesome. I made a ticket and pinged the Bluestore guys about it: http://tracker.ceph.com/issues/40557
On Tue, Jun 25, 2019 at 1:52 AM Thomas Byrne - UKRI STFC <[email protected]> wrote: > > I hadn't tried manual compaction, but it did the trick. The db shrunk down to > 50MB and the OSD booted instantly. Thanks! > > I'm confused as to why the OSDs weren't doing this themselves, especially as > the operation only took a few seconds. But for now I'm happy that this is > easy to rectify if we run into it again. > > I've uploaded the log of a slow boot with debug_bluestore turned up [1], and > I can provide other logs/files if anyone thinks they could be useful. > > Cheers, > Tom > > [1] ceph-post-file: 1829bf40-cce1-4f65-8b35-384935d11446 > > -----Original Message----- > From: Gregory Farnum <[email protected]> > Sent: 24 June 2019 17:30 > To: Byrne, Thomas (STFC,RAL,SC) <[email protected]> > Cc: ceph-users <[email protected]> > Subject: Re: [ceph-users] OSDs taking a long time to boot due to > 'clear_temp_objects', even with fresh PGs > > On Mon, Jun 24, 2019 at 9:06 AM Thomas Byrne - UKRI STFC > <[email protected]> wrote: > > > > Hi all, > > > > > > > > Some bluestore OSDs in our Luminous test cluster have started becoming > > unresponsive and booting very slowly. > > > > > > > > These OSDs have been used for stress testing for hardware destined for our > > production cluster, so have had a number of pools on them with many, many > > objects in the past. All these pools have since been deleted. > > > > > > > > When booting the OSDs, they spend a few minutes *per PG* in > > clear_temp_objects function, even for brand new, empty PGs. The OSD is > > hammering the disk during the clear_temp_objects, with a constant ~30MB/s > > read and all available IOPS consumed. The OSD will finish booting and come > > up fine, but will then start hammering the disk again and fall over at some > > point later, causing the cluster to gradually fall apart. I'm guessing > > something is 'not optimal' in the rocksDB. > > > > > > > > Deleting all pools will stop this behaviour and OSDs without PGs will > > reboot quickly and stay up, but creating a pool will cause OSDs that get > > even a single PG to start exhibiting this behaviour again. > > > > > > > > These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are > > ~1yr old. Upgrading to 12.2.12 did not change this behaviour. A blueFS > > export of a problematic OSD's block device reveals a 1.5GB rocksDB (L0 - > > 63.80 KB, L1 - 62.39 MB, L2 - 116.46 MB, L3 - 1.38 GB), which seems > > excessive for an empty OSD, but it's also the first time I've looked into > > this so may be normal? > > > > > > > > Destroying and recreating an OSD resolves the issue for that OSD, which is > > acceptable for this cluster, but I'm a little concerned a similar thing > > could happen on a production cluster. Ideally, I would like to try and > > understand what has happened before recreating the problematic OSDs. > > > > > > > > Has anyone got any thoughts on what might have happened, or tips on how to > > dig further into this? > > Have you tried a manual compaction? The only other time I see this being > reported was for FileStore-on-ZFS and it was just very slow at metadata > scanning for some reason. ("[ceph-users] Hammer to Jewel Upgrade - Extreme > OSD Boot Time") There has been at least one PR about object listings being > slow in BlueStore when there are a lot of deleted objects, which would match > up with your many deleted pools/objects. > > If you have any debug logs the BlueStore devs might be interested in them to > check if the most recent patches will fix it. > -Greg _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
