Re: [ceph-users] Different disk usage on different OSDs

ivan babrou Tue, 06 Jan 2015 07:39:06 -0800

Restarting OSD fixed PGs that were stuck: http://i.imgur.com/qd5vuzV.png


Still OSD dis usage is very different, 150..250gb. Shall I double PGs again?

On 6 January 2015 at 17:12, ivan babrou <[email protected]> wrote:

> I deleted some old backups and GC is returning some disk space back. But
> cluster state is still bad:
>
> 2015-01-06 13:35:54.102493 mon.0 [INF] pgmap v4017947: 5832 pgs: 23
> active+remapped+wait_backfill, 1
> active+remapped+wait_backfill+backfill_toofull, 2
> active+remapped+backfilling, 5806 active+clean; 9453 GB data, 22784 GB
> used, 21750 GB / 46906 GB avail; 0 B/s wr, 78 op/s; 47275/8940623 objects
> degraded (0.529%)
>
> Here's how disk utilization across OSDs looks like:
> http://i.imgur.com/RWk9rvW.png
>
> Still one OSD is super-huge. I don't understand one PG is toofull if the
> biggest OSD moved from 348gb to 294gb.
>
> root@51f2dde75901:~# ceph pg dump | grep '^[0-9]\+\.' | fgrep full
> dumped all in format plain
> 10.f26 1018 0 1811 0 2321324247 3261 3261
> active+remapped+wait_backfill+backfill_toofull 2015-01-05 15:06:49.504731
> 22897'359132 22897:48571 [91,1] 91 [8,40] 8 19248'358872 2015-01-05
> 11:58:03.062029 18326'358786 2014-12-31 23:43:02.285043
>
>
> On 6 January 2015 at 03:40, Christian Balzer <[email protected]> wrote:
>
>> On Mon, 5 Jan 2015 23:41:17 +0400 ivan babrou wrote:
>>
>> > Rebalancing is almost finished, but things got even worse:
>> > http://i.imgur.com/0HOPZil.png
>> >
>> Looking at that graph only one OSD really kept growing and growing,
>> everything else seems to be a lot denser, less varied than before, as one
>> would have expected.
>>
>> Since I don't think you mentioned it before, what version of Ceph are you
>> using and how are your CRUSH tunables set?
>>
>
> I'm on 0.80.7 upgraded from 0.80.5. I didn't change CRUSH settings at all.
>
> > Moreover, one pg is in active+remapped+wait_backfill+backfill_toofull
>> > state:
>> >
>> > 2015-01-05 19:39:31.995665 mon.0 [INF] pgmap v3979616: 5832 pgs: 23
>> > active+remapped+wait_backfill, 1
>> > active+remapped+wait_backfill+backfill_toofull, 2
>> > active+remapped+backfilling, 5805 active+clean, 1
>> > active+remapped+backfill_toofull; 11210 GB data, 26174 GB used, 18360
>> > GB / 46906 GB avail; 65246/10590590 objects degraded (0.616%)
>> >
>> > So at 55.8% disk space utilization ceph is full. That doesn't look very
>> > well.
>> >
>> Indeed it doesn't.
>>
>> At this point you might want to manually lower the weight of that OSD
>> (probably have to change the osd_backfill_full_ratio first to let it
>> settle).
>>
>
> I'm sure that's what ceph should do, not me.
>
>
>> Thanks to Robert for bringing up the that blueprint for Hammer, lets
>> hope it makes it in and gets backported.
>>
>> I sure hope somebody from the Ceph team will pipe up, but here's what I
>> think is happening:
>> You're using radosgw and I suppose many files are so similar named that
>> they wind up clumping on the same PGs (OSDs).
>>
>
> Nope, you are wrong here. PGs have roughly the same size, I mentioned that
> in my first email. Now the biggest osd has 95 PGs and the smallest one has
> 59 (I only counted PGs from the biggest pool).
>
>
>> Now what I would _think_ could help with that is striping.
>>
>> However radosgw doesn't support the full striping options as RBD does.
>>
>> The only think you can modify is stripe (object) size, which defaults to
>> 4MB. And I bet most of your RGW files are less than that in size, meaning
>> they wind up on just one PG.
>>
>
> Wrong again, I use that cluster for elasticsearch backups and docker
> images. That stuff is usually much bigger than 4mb.
>
> Weird thing: I calculated osd sizes from "ceph pg dump" and they look
> different from what really happens. Biggest OSD is 213gb and the smallest
> is 131gb. GC isn't finished yet, but that seems very different from what
> currently happens.
>
> # ceph pg dump | grep '^[0-9]\+\.' | awk '{ print $1, $6, $14 }' | sed
> 's/[][,]/ /g' > pgs.txt
> # cat pgs.txt | awk '{ sizes[$3] += $2; sizes[$4] += $2; } END { for (o in
> sizes) { printf "%d %.2f gb\n", o, sizes[o] / 1024 / 1024 / 1024; } }' |
> sort -n
>
> 0 198.18 gb
> 1 188.74 gb
> 2 165.94 gb
> 3 143.28 gb
> 4 193.37 gb
> 5 185.87 gb
> 6 146.46 gb
> 7 170.67 gb
> 8 213.93 gb
> 9 200.22 gb
> 10 144.05 gb
> 11 164.44 gb
> 12 158.27 gb
> 13 204.96 gb
> 14 190.04 gb
> 15 158.48 gb
> 16 172.86 gb
> 17 157.05 gb
> 18 179.82 gb
> 19 175.86 gb
> 20 192.63 gb
> 21 179.82 gb
> 22 181.30 gb
> 23 172.97 gb
> 24 141.21 gb
> 25 165.63 gb
> 26 139.87 gb
> 27 184.18 gb
> 28 160.75 gb
> 29 185.88 gb
> 30 186.13 gb
> 31 163.38 gb
> 32 182.92 gb
> 33 134.82 gb
> 34 186.56 gb
> 35 166.91 gb
> 36 163.49 gb
> 37 205.59 gb
> 38 199.26 gb
> 39 151.43 gb
> 40 173.23 gb
> 41 200.54 gb
> 42 198.07 gb
> 43 150.48 gb
> 44 165.54 gb
> 45 193.87 gb
> 46 177.05 gb
> 47 167.97 gb
> 48 186.68 gb
> 49 177.68 gb
> 50 204.94 gb
> 51 184.52 gb
> 52 160.11 gb
> 53 163.33 gb
> 54 137.28 gb
> 55 168.97 gb
> 56 193.08 gb
> 57 176.87 gb
> 58 166.36 gb
> 59 171.98 gb
> 60 175.50 gb
> 61 199.39 gb
> 62 175.31 gb
> 63 164.54 gb
> 64 171.26 gb
> 65 154.86 gb
> 66 166.39 gb
> 67 145.15 gb
> 68 162.55 gb
> 69 181.13 gb
> 70 181.18 gb
> 71 197.67 gb
> 72 164.79 gb
> 73 143.85 gb
> 74 169.17 gb
> 75 183.67 gb
> 76 143.16 gb
> 77 171.91 gb
> 78 167.75 gb
> 79 158.36 gb
> 80 198.83 gb
> 81 158.26 gb
> 82 182.52 gb
> 83 204.65 gb
> 84 179.78 gb
> 85 170.02 gb
> 86 185.70 gb
> 87 138.91 gb
> 88 190.66 gb
> 89 209.43 gb
> 90 193.54 gb
> 91 185.00 gb
> 92 170.31 gb
> 93 140.11 gb
> 94 161.69 gb
> 95 194.53 gb
> 96 184.35 gb
> 97 158.74 gb
> 98 184.39 gb
> 99 174.83 gb
> 100 183.30 gb
> 101 179.82 gb
> 102 160.84 gb
> 103 163.29 gb
> 104 131.92 gb
> 105 158.09 gb
>
>
>
>> Again, would love to hear something from the devs on this one.
>>
>> Christian
>> > On 5 January 2015 at 15:39, ivan babrou <[email protected]> wrote:
>> >
>> > >
>> > >
>> > > On 5 January 2015 at 14:20, Christian Balzer <[email protected]> wrote:
>> > >
>> > >> On Mon, 5 Jan 2015 14:04:28 +0400 ivan babrou wrote:
>> > >>
>> > >> > Hi!
>> > >> >
>> > >> > I have a cluster with 106 osds and disk usage is varying from 166gb
>> > >> > to 316gb. Disk usage is highly correlated to number of pg per osd
>> > >> > (no surprise here). Is there a reason for ceph to allocate more pg
>> > >> > on some nodes?
>> > >> >
>> > >> In essence what Wido said, you're a bit low on PGs.
>> > >>
>> > >> Also given your current utilization, pool 14 is totally oversize with
>> > >> 1024 PGs. You might want to re-create it with a smaller size and
>> > >> double pool 0 to 512 PGs and 10 to 4096.
>> > >> I assume you did raise the PGPs as well when changing the PGs, right?
>> > >>
>> > >
>> > > Yep, pg = pgp for all pools. Pool 14 is just for testing purposes, it
>> > > might get large eventually.
>> > >
>> > > I followed you advice in doubling pools 0 and 10. It is rebalancing at
>> > > 30% degraded now, but so far big osds become bigger and small become
>> > > smaller: http://i.imgur.com/hJcX9Us.png. I hope that trend would
>> > > change before rebalancing is complete.
>> > >
>> > >
>> > >> And yeah, CEPH isn't particular good at balancing stuff by itself,
>> but
>> > >> with sufficient PGs you ought to get the variance below/around 30%.
>> > >>
>> > >
>> > > Is this going to change in the future releases?
>> > >
>> > >
>> > >> Christian
>> > >>
>> > >> > The biggest osds are 30, 42 and 69 (300gb+ each) and the smallest
>> > >> > are
>> > >> 87,
>> > >> > 33 and 55 (170gb each). The biggest pool has 2048 pgs, pools with
>> > >> > very little data has only 8 pgs. PG size in biggest pool is ~6gb
>> > >> > (5.1..6.3 actually).
>> > >> >
>> > >> > Lack of balanced disk usage prevents me from using all the disk
>> > >> > space. When the biggest osd is full, cluster does not accept writes
>> > >> > anymore.
>> > >> >
>> > >> > Here's gist with info about my cluster:
>> > >> > https://gist.github.com/bobrik/fb8ad1d7c38de0ff35ae
>> > >> >
>> > >>
>> > >>
>> > >> --
>> > >> Christian Balzer        Network/Systems Engineer
>> > >> [email protected]           Global OnLine Japan/Fusion Communications
>> > >> http://www.gol.com/
>> > >>
>> > >
>> > >
>> > >
>> > > --
>> > > Regards, Ian Babrou
>> > > http://bobrik.name http://twitter.com/ibobrik skype:i.babrou
>> > >
>> >
>> >
>> >
>>
>>
>> --
>> Christian Balzer        Network/Systems Engineer
>> [email protected]           Global OnLine Japan/Fusion Communications
>> http://www.gol.com/
>>
>
>
>
> --
> Regards, Ian Babrou
> http://bobrik.name http://twitter.com/ibobrik skype:i.babrou
>



-- 
Regards, Ian Babrou
http://bobrik.name http://twitter.com/ibobrik skype:i.babrou

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Different disk usage on different OSDs

Reply via email to