On Mon, Jun 24, 2019 at 9:06 AM Thomas Byrne - UKRI STFC
<[email protected]> wrote:
>
> Hi all,
>
>
>
> Some bluestore OSDs in our Luminous test cluster have started becoming 
> unresponsive and booting very slowly.
>
>
>
> These OSDs have been used for stress testing for hardware destined for our 
> production cluster, so have had a number of pools on them with many, many 
> objects in the past. All these pools have since been deleted.
>
>
>
> When booting the OSDs, they spend a few minutes *per PG* in 
> clear_temp_objects function, even for brand new, empty PGs. The OSD is 
> hammering the disk during the clear_temp_objects, with a constant ~30MB/s 
> read and all available IOPS consumed. The OSD will finish booting and come up 
> fine, but will then start hammering the disk again and fall over at some 
> point later, causing the cluster to gradually fall apart. I'm guessing 
> something is 'not optimal' in the rocksDB.
>
>
>
> Deleting all pools will stop this behaviour and OSDs without PGs will reboot 
> quickly and stay up, but creating a pool will cause OSDs that get even a 
> single PG to start exhibiting this behaviour again.
>
>
>
> These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are ~1yr 
> old. Upgrading to 12.2.12 did not change this behaviour. A blueFS export of a 
> problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 63.80 KB, L1 - 
> 62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems excessive for an empty 
> OSD, but it's also the first time I've looked into this so may be normal?
>
>
>
> Destroying and recreating an OSD resolves the issue for that OSD, which is 
> acceptable for this cluster, but I'm a little concerned a similar thing could 
> happen on a production cluster. Ideally, I would like to try and understand 
> what has happened before recreating the problematic OSDs.
>
>
>
> Has anyone got any thoughts on what might have happened, or tips on how to 
> dig further into this?

Have you tried a manual compaction? The only other time I see this
being reported was for FileStore-on-ZFS and it was just very slow at
metadata scanning for some reason. ("[ceph-users] Hammer to Jewel
Upgrade - Extreme OSD Boot Time") There has been at least one PR about
object listings being slow in BlueStore when there are a lot of
deleted objects, which would match up with your many deleted
pools/objects.

If you have any debug logs the BlueStore devs might be interested in
them to check if the most recent patches will fix it.
-Greg
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to