Re: [ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

Gregory Farnum Wed, 26 Jun 2019 08:18:53 -0700

Awesome. I made a ticket and pinged the Bluestore guys about it:
http://tracker.ceph.com/issues/40557


On Tue, Jun 25, 2019 at 1:52 AM Thomas Byrne - UKRI STFC
<[email protected]> wrote:
>
> I hadn't tried manual compaction, but it did the trick. The db shrunk down to 
> 50MB and the OSD booted instantly. Thanks!
>
> I'm confused as to why the OSDs weren't doing this themselves, especially as 
> the operation only took a few seconds. But for now I'm happy that this is 
> easy to rectify if we run into it again.
>
> I've uploaded the log of a slow boot with debug_bluestore turned up [1], and 
> I can provide other logs/files if anyone thinks they could be useful.
>
> Cheers,
> Tom
>
> [1] ceph-post-file: 1829bf40-cce1-4f65-8b35-384935d11446
>
> -----Original Message-----
> From: Gregory Farnum <[email protected]>
> Sent: 24 June 2019 17:30
> To: Byrne, Thomas (STFC,RAL,SC) <[email protected]>
> Cc: ceph-users <[email protected]>
> Subject: Re: [ceph-users] OSDs taking a long time to boot due to 
> 'clear_temp_objects', even with fresh PGs
>
> On Mon, Jun 24, 2019 at 9:06 AM Thomas Byrne - UKRI STFC 
> <[email protected]> wrote:
> >
> > Hi all,
> >
> >
> >
> > Some bluestore OSDs in our Luminous test cluster have started becoming 
> > unresponsive and booting very slowly.
> >
> >
> >
> > These OSDs have been used for stress testing for hardware destined for our 
> > production cluster, so have had a number of pools on them with many, many 
> > objects in the past. All these pools have since been deleted.
> >
> >
> >
> > When booting the OSDs, they spend a few minutes *per PG* in 
> > clear_temp_objects function, even for brand new, empty PGs. The OSD is 
> > hammering the disk during the clear_temp_objects, with a constant ~30MB/s 
> > read and all available IOPS consumed. The OSD will finish booting and come 
> > up fine, but will then start hammering the disk again and fall over at some 
> > point later, causing the cluster to gradually fall apart. I'm guessing 
> > something is 'not optimal' in the rocksDB.
> >
> >
> >
> > Deleting all pools will stop this behaviour and OSDs without PGs will 
> > reboot quickly and stay up, but creating a pool will cause OSDs that get 
> > even a single PG to start exhibiting this behaviour again.
> >
> >
> >
> > These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are 
> > ~1yr old. Upgrading to 12.2.12 did not change this behaviour. A blueFS 
> > export of a problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 
> > 63.80 KB, L1 - 62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems 
> > excessive for an empty OSD, but it's also the first time I've looked into 
> > this so may be normal?
> >
> >
> >
> > Destroying and recreating an OSD resolves the issue for that OSD, which is 
> > acceptable for this cluster, but I'm a little concerned a similar thing 
> > could happen on a production cluster. Ideally, I would like to try and 
> > understand what has happened before recreating the problematic OSDs.
> >
> >
> >
> > Has anyone got any thoughts on what might have happened, or tips on how to 
> > dig further into this?
>
> Have you tried a manual compaction? The only other time I see this being 
> reported was for FileStore-on-ZFS and it was just very slow at metadata 
> scanning for some reason. ("[ceph-users] Hammer to Jewel Upgrade - Extreme 
> OSD Boot Time") There has been at least one PR about object listings being 
> slow in BlueStore when there are a lot of deleted objects, which would match 
> up with your many deleted pools/objects.
>
> If you have any debug logs the BlueStore devs might be interested in them to 
> check if the most recent patches will fix it.
> -Greg
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

Reply via email to