I hadn't tried manual compaction, but it did the trick. The db shrunk down to 
50MB and the OSD booted instantly. Thanks!

I'm confused as to why the OSDs weren't doing this themselves, especially as 
the operation only took a few seconds. But for now I'm happy that this is easy 
to rectify if we run into it again.

I've uploaded the log of a slow boot with debug_bluestore turned up [1], and I 
can provide other logs/files if anyone thinks they could be useful.

Cheers,
Tom
 
[1] ceph-post-file: 1829bf40-cce1-4f65-8b35-384935d11446

-----Original Message-----
From: Gregory Farnum <gfar...@redhat.com> 
Sent: 24 June 2019 17:30
To: Byrne, Thomas (STFC,RAL,SC) <tom.by...@stfc.ac.uk>
Cc: ceph-users <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] OSDs taking a long time to boot due to 
'clear_temp_objects', even with fresh PGs

On Mon, Jun 24, 2019 at 9:06 AM Thomas Byrne - UKRI STFC <tom.by...@stfc.ac.uk> 
wrote:
>
> Hi all,
>
>
>
> Some bluestore OSDs in our Luminous test cluster have started becoming 
> unresponsive and booting very slowly.
>
>
>
> These OSDs have been used for stress testing for hardware destined for our 
> production cluster, so have had a number of pools on them with many, many 
> objects in the past. All these pools have since been deleted.
>
>
>
> When booting the OSDs, they spend a few minutes *per PG* in 
> clear_temp_objects function, even for brand new, empty PGs. The OSD is 
> hammering the disk during the clear_temp_objects, with a constant ~30MB/s 
> read and all available IOPS consumed. The OSD will finish booting and come up 
> fine, but will then start hammering the disk again and fall over at some 
> point later, causing the cluster to gradually fall apart. I'm guessing 
> something is 'not optimal' in the rocksDB.
>
>
>
> Deleting all pools will stop this behaviour and OSDs without PGs will reboot 
> quickly and stay up, but creating a pool will cause OSDs that get even a 
> single PG to start exhibiting this behaviour again.
>
>
>
> These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are ~1yr 
> old. Upgrading to 12.2.12 did not change this behaviour. A blueFS export of a 
> problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 63.80 KB, L1 - 
> 62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems excessive for an empty 
> OSD, but it's also the first time I've looked into this so may be normal?
>
>
>
> Destroying and recreating an OSD resolves the issue for that OSD, which is 
> acceptable for this cluster, but I'm a little concerned a similar thing could 
> happen on a production cluster. Ideally, I would like to try and understand 
> what has happened before recreating the problematic OSDs.
>
>
>
> Has anyone got any thoughts on what might have happened, or tips on how to 
> dig further into this?

Have you tried a manual compaction? The only other time I see this being 
reported was for FileStore-on-ZFS and it was just very slow at metadata 
scanning for some reason. ("[ceph-users] Hammer to Jewel Upgrade - Extreme OSD 
Boot Time") There has been at least one PR about object listings being slow in 
BlueStore when there are a lot of deleted objects, which would match up with 
your many deleted pools/objects.

If you have any debug logs the BlueStore devs might be interested in them to 
check if the most recent patches will fix it.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to