Re: [ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

2019-06-26 Thread Gregory Farnum
Awesome. I made a ticket and pinged the Bluestore guys about it:
http://tracker.ceph.com/issues/40557

On Tue, Jun 25, 2019 at 1:52 AM Thomas Byrne - UKRI STFC
 wrote:
>
> I hadn't tried manual compaction, but it did the trick. The db shrunk down to 
> 50MB and the OSD booted instantly. Thanks!
>
> I'm confused as to why the OSDs weren't doing this themselves, especially as 
> the operation only took a few seconds. But for now I'm happy that this is 
> easy to rectify if we run into it again.
>
> I've uploaded the log of a slow boot with debug_bluestore turned up [1], and 
> I can provide other logs/files if anyone thinks they could be useful.
>
> Cheers,
> Tom
>
> [1] ceph-post-file: 1829bf40-cce1-4f65-8b35-384935d11446
>
> -Original Message-
> From: Gregory Farnum 
> Sent: 24 June 2019 17:30
> To: Byrne, Thomas (STFC,RAL,SC) 
> Cc: ceph-users 
> Subject: Re: [ceph-users] OSDs taking a long time to boot due to 
> 'clear_temp_objects', even with fresh PGs
>
> On Mon, Jun 24, 2019 at 9:06 AM Thomas Byrne - UKRI STFC 
>  wrote:
> >
> > Hi all,
> >
> >
> >
> > Some bluestore OSDs in our Luminous test cluster have started becoming 
> > unresponsive and booting very slowly.
> >
> >
> >
> > These OSDs have been used for stress testing for hardware destined for our 
> > production cluster, so have had a number of pools on them with many, many 
> > objects in the past. All these pools have since been deleted.
> >
> >
> >
> > When booting the OSDs, they spend a few minutes *per PG* in 
> > clear_temp_objects function, even for brand new, empty PGs. The OSD is 
> > hammering the disk during the clear_temp_objects, with a constant ~30MB/s 
> > read and all available IOPS consumed. The OSD will finish booting and come 
> > up fine, but will then start hammering the disk again and fall over at some 
> > point later, causing the cluster to gradually fall apart. I'm guessing 
> > something is 'not optimal' in the rocksDB.
> >
> >
> >
> > Deleting all pools will stop this behaviour and OSDs without PGs will 
> > reboot quickly and stay up, but creating a pool will cause OSDs that get 
> > even a single PG to start exhibiting this behaviour again.
> >
> >
> >
> > These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are 
> > ~1yr old. Upgrading to 12.2.12 did not change this behaviour. A blueFS 
> > export of a problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 
> > 63.80 KB, L1 - 62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems 
> > excessive for an empty OSD, but it's also the first time I've looked into 
> > this so may be normal?
> >
> >
> >
> > Destroying and recreating an OSD resolves the issue for that OSD, which is 
> > acceptable for this cluster, but I'm a little concerned a similar thing 
> > could happen on a production cluster. Ideally, I would like to try and 
> > understand what has happened before recreating the problematic OSDs.
> >
> >
> >
> > Has anyone got any thoughts on what might have happened, or tips on how to 
> > dig further into this?
>
> Have you tried a manual compaction? The only other time I see this being 
> reported was for FileStore-on-ZFS and it was just very slow at metadata 
> scanning for some reason. ("[ceph-users] Hammer to Jewel Upgrade - Extreme 
> OSD Boot Time") There has been at least one PR about object listings being 
> slow in BlueStore when there are a lot of deleted objects, which would match 
> up with your many deleted pools/objects.
>
> If you have any debug logs the BlueStore devs might be interested in them to 
> check if the most recent patches will fix it.
> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

2019-06-25 Thread Thomas Byrne - UKRI STFC
I hadn't tried manual compaction, but it did the trick. The db shrunk down to 
50MB and the OSD booted instantly. Thanks!

I'm confused as to why the OSDs weren't doing this themselves, especially as 
the operation only took a few seconds. But for now I'm happy that this is easy 
to rectify if we run into it again.

I've uploaded the log of a slow boot with debug_bluestore turned up [1], and I 
can provide other logs/files if anyone thinks they could be useful.

Cheers,
Tom
 
[1] ceph-post-file: 1829bf40-cce1-4f65-8b35-384935d11446

-Original Message-
From: Gregory Farnum  
Sent: 24 June 2019 17:30
To: Byrne, Thomas (STFC,RAL,SC) 
Cc: ceph-users 
Subject: Re: [ceph-users] OSDs taking a long time to boot due to 
'clear_temp_objects', even with fresh PGs

On Mon, Jun 24, 2019 at 9:06 AM Thomas Byrne - UKRI STFC  
wrote:
>
> Hi all,
>
>
>
> Some bluestore OSDs in our Luminous test cluster have started becoming 
> unresponsive and booting very slowly.
>
>
>
> These OSDs have been used for stress testing for hardware destined for our 
> production cluster, so have had a number of pools on them with many, many 
> objects in the past. All these pools have since been deleted.
>
>
>
> When booting the OSDs, they spend a few minutes *per PG* in 
> clear_temp_objects function, even for brand new, empty PGs. The OSD is 
> hammering the disk during the clear_temp_objects, with a constant ~30MB/s 
> read and all available IOPS consumed. The OSD will finish booting and come up 
> fine, but will then start hammering the disk again and fall over at some 
> point later, causing the cluster to gradually fall apart. I'm guessing 
> something is 'not optimal' in the rocksDB.
>
>
>
> Deleting all pools will stop this behaviour and OSDs without PGs will reboot 
> quickly and stay up, but creating a pool will cause OSDs that get even a 
> single PG to start exhibiting this behaviour again.
>
>
>
> These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are ~1yr 
> old. Upgrading to 12.2.12 did not change this behaviour. A blueFS export of a 
> problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 63.80 KB, L1 - 
> 62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems excessive for an empty 
> OSD, but it's also the first time I've looked into this so may be normal?
>
>
>
> Destroying and recreating an OSD resolves the issue for that OSD, which is 
> acceptable for this cluster, but I'm a little concerned a similar thing could 
> happen on a production cluster. Ideally, I would like to try and understand 
> what has happened before recreating the problematic OSDs.
>
>
>
> Has anyone got any thoughts on what might have happened, or tips on how to 
> dig further into this?

Have you tried a manual compaction? The only other time I see this being 
reported was for FileStore-on-ZFS and it was just very slow at metadata 
scanning for some reason. ("[ceph-users] Hammer to Jewel Upgrade - Extreme OSD 
Boot Time") There has been at least one PR about object listings being slow in 
BlueStore when there are a lot of deleted objects, which would match up with 
your many deleted pools/objects.

If you have any debug logs the BlueStore devs might be interested in them to 
check if the most recent patches will fix it.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

2019-06-24 Thread Gregory Farnum
On Mon, Jun 24, 2019 at 9:06 AM Thomas Byrne - UKRI STFC
 wrote:
>
> Hi all,
>
>
>
> Some bluestore OSDs in our Luminous test cluster have started becoming 
> unresponsive and booting very slowly.
>
>
>
> These OSDs have been used for stress testing for hardware destined for our 
> production cluster, so have had a number of pools on them with many, many 
> objects in the past. All these pools have since been deleted.
>
>
>
> When booting the OSDs, they spend a few minutes *per PG* in 
> clear_temp_objects function, even for brand new, empty PGs. The OSD is 
> hammering the disk during the clear_temp_objects, with a constant ~30MB/s 
> read and all available IOPS consumed. The OSD will finish booting and come up 
> fine, but will then start hammering the disk again and fall over at some 
> point later, causing the cluster to gradually fall apart. I'm guessing 
> something is 'not optimal' in the rocksDB.
>
>
>
> Deleting all pools will stop this behaviour and OSDs without PGs will reboot 
> quickly and stay up, but creating a pool will cause OSDs that get even a 
> single PG to start exhibiting this behaviour again.
>
>
>
> These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are ~1yr 
> old. Upgrading to 12.2.12 did not change this behaviour. A blueFS export of a 
> problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 63.80 KB, L1 - 
> 62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems excessive for an empty 
> OSD, but it's also the first time I've looked into this so may be normal?
>
>
>
> Destroying and recreating an OSD resolves the issue for that OSD, which is 
> acceptable for this cluster, but I'm a little concerned a similar thing could 
> happen on a production cluster. Ideally, I would like to try and understand 
> what has happened before recreating the problematic OSDs.
>
>
>
> Has anyone got any thoughts on what might have happened, or tips on how to 
> dig further into this?

Have you tried a manual compaction? The only other time I see this
being reported was for FileStore-on-ZFS and it was just very slow at
metadata scanning for some reason. ("[ceph-users] Hammer to Jewel
Upgrade - Extreme OSD Boot Time") There has been at least one PR about
object listings being slow in BlueStore when there are a lot of
deleted objects, which would match up with your many deleted
pools/objects.

If you have any debug logs the BlueStore devs might be interested in
them to check if the most recent patches will fix it.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

2019-06-24 Thread Thomas Byrne - UKRI STFC
Hi all,



Some bluestore OSDs in our Luminous test cluster have started becoming 
unresponsive and booting very slowly.



These OSDs have been used for stress testing for hardware destined for our 
production cluster, so have had a number of pools on them with many, many 
objects in the past. All these pools have since been deleted.



When booting the OSDs, they spend a few minutes *per PG* in clear_temp_objects 
function, even for brand new, empty PGs. The OSD is hammering the disk during 
the clear_temp_objects, with a constant ~30MB/s read and all available IOPS 
consumed. The OSD will finish booting and come up fine, but will then start 
hammering the disk again and fall over at some point later, causing the cluster 
to gradually fall apart. I'm guessing something is 'not optimal' in the rocksDB.



Deleting all pools will stop this behaviour and OSDs without PGs will reboot 
quickly and stay up, but creating a pool will cause OSDs that get even a single 
PG to start exhibiting this behaviour again.



These are HDD OSDs, with WAL and rocksDB on disk. I would guess they are ~1yr 
old. Upgrading to 12.2.12 did not change this behaviour. A blueFS export of a 
problematic OSD's block device reveals a 1.5GB rocksDB (L0 - 63.80 KB, L1 - 
62.39 MB,  L2 - 116.46 MB,  L3 - 1.38 GB), which seems excessive for an empty 
OSD, but it's also the first time I've looked into this so may be normal?



Destroying and recreating an OSD resolves the issue for that OSD, which is 
acceptable for this cluster, but I'm a little concerned a similar thing could 
happen on a production cluster. Ideally, I would like to try and understand 
what has happened before recreating the problematic OSDs.



Has anyone got any thoughts on what might have happened, or tips on how to dig 
further into this?


Cheers,
Tom

Tom Byrne
Storage System Administrator
Scientific Computing Department
Science and Technology Facilities Council
Rutherford Appleton Laboratory
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com