Hi Greg,

> This does sound weird, but I also notice that in your earlier email you
> seemed to have only ~5k PGs across  ~1400 OSDs, which is a pretty
> low number. You may just have a truly horrible PG balance; can you share
> more details (eg ceph osd df)?


Our distribution is pretty bad, we're getting close to the point where the most 
filled disk is getting close to the nearfull ratio and already has more than 
twice as much data as the cluster fill ratio. My view is that we need to at 
least double the PG count across the cluster. Here's some data: 
https://pastebin.com/qX0LXxid

However, I think this particular issue is down to compaction problems. The 
oldest SST files in the largest LevelDBs date back to Feb 21 (the oldest files 
in normal-sized LevelDBs are no more than a week old):

# du -sh /var/lib/ceph/osd/ceph-348/current/omap/
66G     /var/lib/ceph/osd/ceph-348/current/omap/
# ll -t /var/lib/ceph/osd/ceph-348/current/omap/ | tail
-rw-r--r--. 1 ceph ceph  2109703 Feb 21 01:07 013472.sst
-rw-r--r--. 1 ceph ceph  2104172 Feb 21 01:07 013470.sst
-rw-r--r--. 1 ceph ceph  2102942 Feb 21 01:07 013468.sst
-rw-r--r--. 1 ceph ceph  2102906 Feb 21 01:04 013446.sst
-rw-r--r--. 1 ceph ceph  2102977 Feb 21 01:04 013444.sst
-rw-r--r--. 1 ceph ceph  2102667 Feb 21 01:04 013442.sst
-rw-r--r--. 1 ceph ceph  2102903 Feb 21 01:04 013440.sst
-rw-r--r--. 1 ceph ceph      172 Jan  6 15:45 LOG
-rw-r--r--. 1 ceph ceph       57 Jan  6 15:45 LOG.old
-rw-r--r--. 1 ceph ceph        0 Jan  6 15:45 LOCK

The corresponding daemon has been running for a while:

# systemctl status ceph-osd@348
● [email protected] - Ceph object storage daemon osd.348
   Loaded: loaded (/usr/lib/systemd/system/[email protected]; enabled; vendor 
preset: disabled)
   Active: active (running) since Mon 2017-03-13 14:23:27 GMT; 2 months 11 days 
ago

This is confirmed as being the case for the top 3 largest LevelDBs.

Given the inflation that was observed while the OSD was reporting compacting 
operations I thought this might be a compaction issue.

I have performed the following test:

I chose osd.101 which had an average sized LevelDB and proceeded to extract 
that LevelDB and poke around a bit.

- the size of osd.101's omap directory was 25M
- it contained 99627 keys
- when compacted it went down to 15M

By comparison, osd.980's omap directory:

- was 67G in size
- it contained 101773 keys
- when compacted it went down to 44M

Both omaps had similar key and value sizes.

We do not have any options regarding OSD LevelDB compaction set in ceph.conf so 
OSDs seem to be compacting when they see fit. This seems to be mostly working. 
What's troubling is the fact that many of these LevelDBs go into a compacting 
frenzy, where the OSD spends upwards of an hour compacting a LevelDB, during 
which time the LevelDB is actually exploding in size and then remains at that 
size for at least a couple of weeks.

This seems a bit similar to http://tracker.ceph.com/issues/13990 which Dan van 
der Ster pointed out, although not quite the same behaviour. Is there a way we 
can try to trigger the OSD to do compaction and/or manually do it and see what 
happens? How risky is this (this is our production service after all)?


Thanks,

George
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to