Re: [ceph-users] Sudden omap growth on some OSDs

george.vasilakakos Tue, 12 Dec 2017 03:17:49 -0800

On 11 Dec 2017, at 18:24, Gregory Farnum 
<gfar...@redhat.com<mailto:gfar...@redhat.com>> wrote:


Hmm, this does all sound odd. Have you tried just restarting the primary OSD 
yet? That frequently resolves transient oddities like this.
If not, I'll go poke at the kraken source and one of the developers more 
familiar with the recovery processes we're seeing here.
-Greg


Hi Greg,

I’ve tried this, no effect. Also, on Friday, we tried removing an OSD (not the 
primary), the OSD that was chosen to replace it had it’s LevelDB grow to 7GiB 
by now. Yesterday it was 5.3.
We’re not seeing any errors logged by the OSDs with the default logging level 
either.

Do you have any comments on the fact that the primary sees the PG’s state as 
being different to what the peers think?
Now, with a new primary I’m seeing the last peer in the set reporting it’s 
‘active+clean’, as is the primary, all other are saying it’s 
‘active+clean+degraded’ (according to PG query output).

This problem is quite weird I think. I copied a LevelDB and dumped a key list; 
the largest in GiB had 66% the number of keys that the average LevelDB has. The 
main difference with the ones that have been around for a while is that they 
have a lot more files that were last touched on the days when the problem 
started but most other LevelDBs have compacted those away and only have about 7 
days old files (as opposed to 3 week old ones that the big ones keep around). 
The big ones do seem to do compactions, they just don’t seem to get rid of that 
stuff.



Thanks,

George


On Fri, Dec 8, 2017 at 7:30 AM 
<george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk>> wrote:

________________________________
From: Gregory Farnum [gfar...@redhat.com<mailto:gfar...@redhat.com>]
Sent: 07 December 2017 21:57
To: Vasilakakos, George (STFC,RAL,SC)
Cc: drakonst...@gmail.com<mailto:drakonst...@gmail.com>; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Sudden omap growth on some OSDs



On Thu, Dec 7, 2017 at 4:41 AM 
<george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk><mailto:george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk>>>
 wrote:

________________________________
From: Gregory Farnum 
[gfar...@redhat.com<mailto:gfar...@redhat.com><mailto:gfar...@redhat.com<mailto:gfar...@redhat.com>>]
Sent: 06 December 2017 22:50
To: David Turner
Cc: Vasilakakos, George (STFC,RAL,SC); 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com><mailto:ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Sudden omap growth on some OSDs

On Wed, Dec 6, 2017 at 2:35 PM David Turner 
<drakonst...@gmail.com<mailto:drakonst...@gmail.com><mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com>><mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com><mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com>>>>
 wrote:
I have no proof or anything other than a hunch, but OSDs don't trim omaps 
unless all PGs are healthy.  If this PG is actually not healthy, but the 
cluster doesn't realize it while these 11 involved OSDs do realize that the PG 
is unhealthy... You would see this exact problem.  The OSDs think a PG is 
unhealthy so they aren't trimming their omaps while the cluster doesn't seem to 
be aware of it and everything else is trimming their omaps properly.

I think you're confusing omaps and OSDMaps here. OSDMaps, like omap, are stored 
in leveldb, but they have different trimming rules.


I don't know what to do about it, but I hope it helps get you (or someone else 
on the ML) towards a resolution.

On Wed, Dec 6, 2017 at 1:59 PM 
<george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk><mailto:george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk>><mailto:george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk><mailto:george.vasilaka...@stfc.ac.uk<mailto:george.vasilaka...@stfc.ac.uk>>>>
 wrote:
Hi ceph-users,

We have a Ceph cluster (running Kraken) that is exhibiting some odd behaviour.
A couple weeks ago, the LevelDBs on some our OSDs started growing large (now at 
around 20G size).

The one thing they have in common is the 11 disks with inflating LevelDBs are 
all in the set for one PG in one of our pools (EC 8+3). This pool started to 
see use around the time the LevelDBs started inflating. Compactions are running 
and they do go down in size a bit but the overall trend is one of rapid growth. 
The other 2000+ OSDs in the cluster have LevelDBs between 650M and 1.2G.
This PG has nothing to separate it from the others in its pool, within 5% of 
average number of objects per PG, no hot-spotting in terms of load, no weird 
states reported by ceph status.

The one odd thing about it is the pg query output mentions it is active+clean, 
but it has a recovery state, which it enters every morning between 9 and 10am, 
where it mentions a "might_have_unfound" situation and having probed all other 
set members. A deep scrub of the PG didn't turn up anything.

You need to be more specific here. What do you mean it "enters into" the 
recovery state every morning?

Here's what PG query showed me yesterday:
    "recovery_state": [
        {
            "name": "Started\/Primary\/Active",
            "enter_time": "2017-12-05 09:48:57.730385",
            "might_have_unfound": [
                {
                    "osd": "79(1)",
                    "status": "already probed"
                },
                {
                    "osd": "337(9)",
                    "status": "already probed"
                },... it goes on to list all peers of this OSD in that PG.

IIRC that's just a normal thing when there's any kind of recovery happening — 
it builds up a set during peering of OSDs that might have data, in case it 
discovers stuff missing.

OK. But this is the only PG mentioning "might_have_unfound" across the two most 
used pools in our cluster and it's the only one that has all of its omap dirs 
at sizes more than 15 times the average for the cluster.



How many PGs are in your 8+3 pool, and are all your OSDs hosting EC pools? What 
are you using the cluster for?

2048 PGs in this pool, also another 2048 PG EC pool (same profile) and two more 
1024 PG EC pools (also same profile). Then a set of RGW auxiliary pools with 
3-way replication.
I'm not 100% sure but I think all of our OSDs should have a few PGs from one of 
the EC pools. Our rules don't make a distinction so it's probabilistic. We're 
using the cluster as an object store, minor RGW use and custom gateways using 
libradosstriper.
It's also worth pointing out that an OSD in that PG was taken out of the 
cluster earlier today and pg query shows the following weirdness:
The primary thinks it's active+clean but, in the peer_info section all peers 
report it is "active+undersized+degraded+remapped+backfilling". It has shown 
this discrepancy before between the primary thinking it's a+c and the rest of 
the set seeing it a+c+degraded.

Again, exactly what output makes you say the primary thinks it's active+clean 
but the others have more complex recovery states?

I've pastebin'ed a query output: https://pastebin.com/PbGSaZyF. The PG state is 
reported "active+clean" in the top section but the peer_info sections have it 
"active+undersized+degraded". I may be reading the query output wrong but this 
seems odd.

In the query output we're showing the following for recovery state:
"recovery_state": [
        {
            "name": "Started\/Primary\/Active",
            "enter_time": "2017-12-07 08:41:57.850220",
            "might_have_unfound": [],
            "recovery_progress": {
                "backfill_targets": [],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []
                },
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {
                    "recovery_ops": [],
                    "read_ops": []
                }


The cluster is now starting to manifest slow requests on the OSDs with the 
large LevelDBs, although not in the particular PG.

Well, there have been a few causes of large LevelDBs, but given that you have 
degraded PGs and a bunch of EC pools, my guess is that the PG logs are getting 
extended thanks to the PG states. EC PG logs can be much larger than replicated 
ones, since EC pools need to be able to reverse the IO in those cases. So you 
need to get your PGs clean first and then see if the LevelDB shrinks down or 
not.

To be expected I guess. However, the average LevelDB in this cluster is around 
200MB, all OSDs in this PG are in the multi-gigabyte range with a couple 
pushing the 20GB mark. The numbers on this PG do not otherwise stand out, so 
I'm wondering what makes this PG so special. No PGs on these OSDs are reporting 
any other state than active+clean.

Thanks,

George

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Sudden omap growth on some OSDs

Reply via email to