Re: [ceph-users] Different disk usage on different OSDs

ivan babrou Tue, 06 Jan 2015 06:36:56 -0800

I deleted some old backups and GC is returning some disk space back. But
cluster state is still bad:


2015-01-06 13:35:54.102493 mon.0 [INF] pgmap v4017947: 5832 pgs: 23
active+remapped+wait_backfill, 1
active+remapped+wait_backfill+backfill_toofull, 2
active+remapped+backfilling, 5806 active+clean; 9453 GB data, 22784 GB
used, 21750 GB / 46906 GB avail; 0 B/s wr, 78 op/s; 47275/8940623 objects
degraded (0.529%)

Here's how disk utilization across OSDs looks like:
http://i.imgur.com/RWk9rvW.png

Still one OSD is super-huge. I don't understand one PG is toofull if the
biggest OSD moved from 348gb to 294gb.

root@51f2dde75901:~# ceph pg dump | grep '^[0-9]\+\.' | fgrep full
dumped all in format plain
10.f26 1018 0 1811 0 2321324247 3261 3261
active+remapped+wait_backfill+backfill_toofull 2015-01-05 15:06:49.504731
22897'359132 22897:48571 [91,1] 91 [8,40] 8 19248'358872 2015-01-05
11:58:03.062029 18326'358786 2014-12-31 23:43:02.285043


On 6 January 2015 at 03:40, Christian Balzer <ch...@gol.com> wrote:

> On Mon, 5 Jan 2015 23:41:17 +0400 ivan babrou wrote:
>
> > Rebalancing is almost finished, but things got even worse:
> > http://i.imgur.com/0HOPZil.png
> >
> Looking at that graph only one OSD really kept growing and growing,
> everything else seems to be a lot denser, less varied than before, as one
> would have expected.
>
> Since I don't think you mentioned it before, what version of Ceph are you
> using and how are your CRUSH tunables set?
>

I'm on 0.80.7 upgraded from 0.80.5. I didn't change CRUSH settings at all.

> Moreover, one pg is in active+remapped+wait_backfill+backfill_toofull
> > state:
> >
> > 2015-01-05 19:39:31.995665 mon.0 [INF] pgmap v3979616: 5832 pgs: 23
> > active+remapped+wait_backfill, 1
> > active+remapped+wait_backfill+backfill_toofull, 2
> > active+remapped+backfilling, 5805 active+clean, 1
> > active+remapped+backfill_toofull; 11210 GB data, 26174 GB used, 18360
> > GB / 46906 GB avail; 65246/10590590 objects degraded (0.616%)
> >
> > So at 55.8% disk space utilization ceph is full. That doesn't look very
> > well.
> >
> Indeed it doesn't.
>
> At this point you might want to manually lower the weight of that OSD
> (probably have to change the osd_backfill_full_ratio first to let it
> settle).
>

I'm sure that's what ceph should do, not me.


> Thanks to Robert for bringing up the that blueprint for Hammer, lets
> hope it makes it in and gets backported.
>
> I sure hope somebody from the Ceph team will pipe up, but here's what I
> think is happening:
> You're using radosgw and I suppose many files are so similar named that
> they wind up clumping on the same PGs (OSDs).
>

Nope, you are wrong here. PGs have roughly the same size, I mentioned that
in my first email. Now the biggest osd has 95 PGs and the smallest one has
59 (I only counted PGs from the biggest pool).


> Now what I would _think_ could help with that is striping.
>
> However radosgw doesn't support the full striping options as RBD does.
>
> The only think you can modify is stripe (object) size, which defaults to
> 4MB. And I bet most of your RGW files are less than that in size, meaning
> they wind up on just one PG.
>

Wrong again, I use that cluster for elasticsearch backups and docker
images. That stuff is usually much bigger than 4mb.

Weird thing: I calculated osd sizes from "ceph pg dump" and they look
different from what really happens. Biggest OSD is 213gb and the smallest
is 131gb. GC isn't finished yet, but that seems very different from what
currently happens.

# ceph pg dump | grep '^[0-9]\+\.' | awk '{ print $1, $6, $14 }' | sed
's/[][,]/ /g' > pgs.txt
# cat pgs.txt | awk '{ sizes[$3] += $2; sizes[$4] += $2; } END { for (o in
sizes) { printf "%d %.2f gb\n", o, sizes[o] / 1024 / 1024 / 1024; } }' |
sort -n

0 198.18 gb
1 188.74 gb
2 165.94 gb
3 143.28 gb
4 193.37 gb
5 185.87 gb
6 146.46 gb
7 170.67 gb
8 213.93 gb
9 200.22 gb
10 144.05 gb
11 164.44 gb
12 158.27 gb
13 204.96 gb
14 190.04 gb
15 158.48 gb
16 172.86 gb
17 157.05 gb
18 179.82 gb
19 175.86 gb
20 192.63 gb
21 179.82 gb
22 181.30 gb
23 172.97 gb
24 141.21 gb
25 165.63 gb
26 139.87 gb
27 184.18 gb
28 160.75 gb
29 185.88 gb
30 186.13 gb
31 163.38 gb
32 182.92 gb
33 134.82 gb
34 186.56 gb
35 166.91 gb
36 163.49 gb
37 205.59 gb
38 199.26 gb
39 151.43 gb
40 173.23 gb
41 200.54 gb
42 198.07 gb
43 150.48 gb
44 165.54 gb
45 193.87 gb
46 177.05 gb
47 167.97 gb
48 186.68 gb
49 177.68 gb
50 204.94 gb
51 184.52 gb
52 160.11 gb
53 163.33 gb
54 137.28 gb
55 168.97 gb
56 193.08 gb
57 176.87 gb
58 166.36 gb
59 171.98 gb
60 175.50 gb
61 199.39 gb
62 175.31 gb
63 164.54 gb
64 171.26 gb
65 154.86 gb
66 166.39 gb
67 145.15 gb
68 162.55 gb
69 181.13 gb
70 181.18 gb
71 197.67 gb
72 164.79 gb
73 143.85 gb
74 169.17 gb
75 183.67 gb
76 143.16 gb
77 171.91 gb
78 167.75 gb
79 158.36 gb
80 198.83 gb
81 158.26 gb
82 182.52 gb
83 204.65 gb
84 179.78 gb
85 170.02 gb
86 185.70 gb
87 138.91 gb
88 190.66 gb
89 209.43 gb
90 193.54 gb
91 185.00 gb
92 170.31 gb
93 140.11 gb
94 161.69 gb
95 194.53 gb
96 184.35 gb
97 158.74 gb
98 184.39 gb
99 174.83 gb
100 183.30 gb
101 179.82 gb
102 160.84 gb
103 163.29 gb
104 131.92 gb
105 158.09 gb



> Again, would love to hear something from the devs on this one.
>
> Christian
> > On 5 January 2015 at 15:39, ivan babrou <ibob...@gmail.com> wrote:
> >
> > >
> > >
> > > On 5 January 2015 at 14:20, Christian Balzer <ch...@gol.com> wrote:
> > >
> > >> On Mon, 5 Jan 2015 14:04:28 +0400 ivan babrou wrote:
> > >>
> > >> > Hi!
> > >> >
> > >> > I have a cluster with 106 osds and disk usage is varying from 166gb
> > >> > to 316gb. Disk usage is highly correlated to number of pg per osd
> > >> > (no surprise here). Is there a reason for ceph to allocate more pg
> > >> > on some nodes?
> > >> >
> > >> In essence what Wido said, you're a bit low on PGs.
> > >>
> > >> Also given your current utilization, pool 14 is totally oversize with
> > >> 1024 PGs. You might want to re-create it with a smaller size and
> > >> double pool 0 to 512 PGs and 10 to 4096.
> > >> I assume you did raise the PGPs as well when changing the PGs, right?
> > >>
> > >
> > > Yep, pg = pgp for all pools. Pool 14 is just for testing purposes, it
> > > might get large eventually.
> > >
> > > I followed you advice in doubling pools 0 and 10. It is rebalancing at
> > > 30% degraded now, but so far big osds become bigger and small become
> > > smaller: http://i.imgur.com/hJcX9Us.png. I hope that trend would
> > > change before rebalancing is complete.
> > >
> > >
> > >> And yeah, CEPH isn't particular good at balancing stuff by itself, but
> > >> with sufficient PGs you ought to get the variance below/around 30%.
> > >>
> > >
> > > Is this going to change in the future releases?
> > >
> > >
> > >> Christian
> > >>
> > >> > The biggest osds are 30, 42 and 69 (300gb+ each) and the smallest
> > >> > are
> > >> 87,
> > >> > 33 and 55 (170gb each). The biggest pool has 2048 pgs, pools with
> > >> > very little data has only 8 pgs. PG size in biggest pool is ~6gb
> > >> > (5.1..6.3 actually).
> > >> >
> > >> > Lack of balanced disk usage prevents me from using all the disk
> > >> > space. When the biggest osd is full, cluster does not accept writes
> > >> > anymore.
> > >> >
> > >> > Here's gist with info about my cluster:
> > >> > https://gist.github.com/bobrik/fb8ad1d7c38de0ff35ae
> > >> >
> > >>
> > >>
> > >> --
> > >> Christian Balzer        Network/Systems Engineer
> > >> ch...@gol.com           Global OnLine Japan/Fusion Communications
> > >> http://www.gol.com/
> > >>
> > >
> > >
> > >
> > > --
> > > Regards, Ian Babrou
> > > http://bobrik.name http://twitter.com/ibobrik skype:i.babrou
> > >
> >
> >
> >
>
>
> --
> Christian Balzer        Network/Systems Engineer
> ch...@gol.com           Global OnLine Japan/Fusion Communications
> http://www.gol.com/
>



-- 
Regards, Ian Babrou
http://bobrik.name http://twitter.com/ibobrik skype:i.babrou

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Different disk usage on different OSDs

Reply via email to