Restarting OSD fixed PGs that were stuck: http://i.imgur.com/qd5vuzV.png
Still OSD dis usage is very different, 150..250gb. Shall I double PGs again? On 6 January 2015 at 17:12, ivan babrou <[email protected]> wrote: > I deleted some old backups and GC is returning some disk space back. But > cluster state is still bad: > > 2015-01-06 13:35:54.102493 mon.0 [INF] pgmap v4017947: 5832 pgs: 23 > active+remapped+wait_backfill, 1 > active+remapped+wait_backfill+backfill_toofull, 2 > active+remapped+backfilling, 5806 active+clean; 9453 GB data, 22784 GB > used, 21750 GB / 46906 GB avail; 0 B/s wr, 78 op/s; 47275/8940623 objects > degraded (0.529%) > > Here's how disk utilization across OSDs looks like: > http://i.imgur.com/RWk9rvW.png > > Still one OSD is super-huge. I don't understand one PG is toofull if the > biggest OSD moved from 348gb to 294gb. > > root@51f2dde75901:~# ceph pg dump | grep '^[0-9]\+\.' | fgrep full > dumped all in format plain > 10.f26 1018 0 1811 0 2321324247 3261 3261 > active+remapped+wait_backfill+backfill_toofull 2015-01-05 15:06:49.504731 > 22897'359132 22897:48571 [91,1] 91 [8,40] 8 19248'358872 2015-01-05 > 11:58:03.062029 18326'358786 2014-12-31 23:43:02.285043 > > > On 6 January 2015 at 03:40, Christian Balzer <[email protected]> wrote: > >> On Mon, 5 Jan 2015 23:41:17 +0400 ivan babrou wrote: >> >> > Rebalancing is almost finished, but things got even worse: >> > http://i.imgur.com/0HOPZil.png >> > >> Looking at that graph only one OSD really kept growing and growing, >> everything else seems to be a lot denser, less varied than before, as one >> would have expected. >> >> Since I don't think you mentioned it before, what version of Ceph are you >> using and how are your CRUSH tunables set? >> > > I'm on 0.80.7 upgraded from 0.80.5. I didn't change CRUSH settings at all. > > > Moreover, one pg is in active+remapped+wait_backfill+backfill_toofull >> > state: >> > >> > 2015-01-05 19:39:31.995665 mon.0 [INF] pgmap v3979616: 5832 pgs: 23 >> > active+remapped+wait_backfill, 1 >> > active+remapped+wait_backfill+backfill_toofull, 2 >> > active+remapped+backfilling, 5805 active+clean, 1 >> > active+remapped+backfill_toofull; 11210 GB data, 26174 GB used, 18360 >> > GB / 46906 GB avail; 65246/10590590 objects degraded (0.616%) >> > >> > So at 55.8% disk space utilization ceph is full. That doesn't look very >> > well. >> > >> Indeed it doesn't. >> >> At this point you might want to manually lower the weight of that OSD >> (probably have to change the osd_backfill_full_ratio first to let it >> settle). >> > > I'm sure that's what ceph should do, not me. > > >> Thanks to Robert for bringing up the that blueprint for Hammer, lets >> hope it makes it in and gets backported. >> >> I sure hope somebody from the Ceph team will pipe up, but here's what I >> think is happening: >> You're using radosgw and I suppose many files are so similar named that >> they wind up clumping on the same PGs (OSDs). >> > > Nope, you are wrong here. PGs have roughly the same size, I mentioned that > in my first email. Now the biggest osd has 95 PGs and the smallest one has > 59 (I only counted PGs from the biggest pool). > > >> Now what I would _think_ could help with that is striping. >> >> However radosgw doesn't support the full striping options as RBD does. >> >> The only think you can modify is stripe (object) size, which defaults to >> 4MB. And I bet most of your RGW files are less than that in size, meaning >> they wind up on just one PG. >> > > Wrong again, I use that cluster for elasticsearch backups and docker > images. That stuff is usually much bigger than 4mb. > > Weird thing: I calculated osd sizes from "ceph pg dump" and they look > different from what really happens. Biggest OSD is 213gb and the smallest > is 131gb. GC isn't finished yet, but that seems very different from what > currently happens. > > # ceph pg dump | grep '^[0-9]\+\.' | awk '{ print $1, $6, $14 }' | sed > 's/[][,]/ /g' > pgs.txt > # cat pgs.txt | awk '{ sizes[$3] += $2; sizes[$4] += $2; } END { for (o in > sizes) { printf "%d %.2f gb\n", o, sizes[o] / 1024 / 1024 / 1024; } }' | > sort -n > > 0 198.18 gb > 1 188.74 gb > 2 165.94 gb > 3 143.28 gb > 4 193.37 gb > 5 185.87 gb > 6 146.46 gb > 7 170.67 gb > 8 213.93 gb > 9 200.22 gb > 10 144.05 gb > 11 164.44 gb > 12 158.27 gb > 13 204.96 gb > 14 190.04 gb > 15 158.48 gb > 16 172.86 gb > 17 157.05 gb > 18 179.82 gb > 19 175.86 gb > 20 192.63 gb > 21 179.82 gb > 22 181.30 gb > 23 172.97 gb > 24 141.21 gb > 25 165.63 gb > 26 139.87 gb > 27 184.18 gb > 28 160.75 gb > 29 185.88 gb > 30 186.13 gb > 31 163.38 gb > 32 182.92 gb > 33 134.82 gb > 34 186.56 gb > 35 166.91 gb > 36 163.49 gb > 37 205.59 gb > 38 199.26 gb > 39 151.43 gb > 40 173.23 gb > 41 200.54 gb > 42 198.07 gb > 43 150.48 gb > 44 165.54 gb > 45 193.87 gb > 46 177.05 gb > 47 167.97 gb > 48 186.68 gb > 49 177.68 gb > 50 204.94 gb > 51 184.52 gb > 52 160.11 gb > 53 163.33 gb > 54 137.28 gb > 55 168.97 gb > 56 193.08 gb > 57 176.87 gb > 58 166.36 gb > 59 171.98 gb > 60 175.50 gb > 61 199.39 gb > 62 175.31 gb > 63 164.54 gb > 64 171.26 gb > 65 154.86 gb > 66 166.39 gb > 67 145.15 gb > 68 162.55 gb > 69 181.13 gb > 70 181.18 gb > 71 197.67 gb > 72 164.79 gb > 73 143.85 gb > 74 169.17 gb > 75 183.67 gb > 76 143.16 gb > 77 171.91 gb > 78 167.75 gb > 79 158.36 gb > 80 198.83 gb > 81 158.26 gb > 82 182.52 gb > 83 204.65 gb > 84 179.78 gb > 85 170.02 gb > 86 185.70 gb > 87 138.91 gb > 88 190.66 gb > 89 209.43 gb > 90 193.54 gb > 91 185.00 gb > 92 170.31 gb > 93 140.11 gb > 94 161.69 gb > 95 194.53 gb > 96 184.35 gb > 97 158.74 gb > 98 184.39 gb > 99 174.83 gb > 100 183.30 gb > 101 179.82 gb > 102 160.84 gb > 103 163.29 gb > 104 131.92 gb > 105 158.09 gb > > > >> Again, would love to hear something from the devs on this one. >> >> Christian >> > On 5 January 2015 at 15:39, ivan babrou <[email protected]> wrote: >> > >> > > >> > > >> > > On 5 January 2015 at 14:20, Christian Balzer <[email protected]> wrote: >> > > >> > >> On Mon, 5 Jan 2015 14:04:28 +0400 ivan babrou wrote: >> > >> >> > >> > Hi! >> > >> > >> > >> > I have a cluster with 106 osds and disk usage is varying from 166gb >> > >> > to 316gb. Disk usage is highly correlated to number of pg per osd >> > >> > (no surprise here). Is there a reason for ceph to allocate more pg >> > >> > on some nodes? >> > >> > >> > >> In essence what Wido said, you're a bit low on PGs. >> > >> >> > >> Also given your current utilization, pool 14 is totally oversize with >> > >> 1024 PGs. You might want to re-create it with a smaller size and >> > >> double pool 0 to 512 PGs and 10 to 4096. >> > >> I assume you did raise the PGPs as well when changing the PGs, right? >> > >> >> > > >> > > Yep, pg = pgp for all pools. Pool 14 is just for testing purposes, it >> > > might get large eventually. >> > > >> > > I followed you advice in doubling pools 0 and 10. It is rebalancing at >> > > 30% degraded now, but so far big osds become bigger and small become >> > > smaller: http://i.imgur.com/hJcX9Us.png. I hope that trend would >> > > change before rebalancing is complete. >> > > >> > > >> > >> And yeah, CEPH isn't particular good at balancing stuff by itself, >> but >> > >> with sufficient PGs you ought to get the variance below/around 30%. >> > >> >> > > >> > > Is this going to change in the future releases? >> > > >> > > >> > >> Christian >> > >> >> > >> > The biggest osds are 30, 42 and 69 (300gb+ each) and the smallest >> > >> > are >> > >> 87, >> > >> > 33 and 55 (170gb each). The biggest pool has 2048 pgs, pools with >> > >> > very little data has only 8 pgs. PG size in biggest pool is ~6gb >> > >> > (5.1..6.3 actually). >> > >> > >> > >> > Lack of balanced disk usage prevents me from using all the disk >> > >> > space. When the biggest osd is full, cluster does not accept writes >> > >> > anymore. >> > >> > >> > >> > Here's gist with info about my cluster: >> > >> > https://gist.github.com/bobrik/fb8ad1d7c38de0ff35ae >> > >> > >> > >> >> > >> >> > >> -- >> > >> Christian Balzer Network/Systems Engineer >> > >> [email protected] Global OnLine Japan/Fusion Communications >> > >> http://www.gol.com/ >> > >> >> > > >> > > >> > > >> > > -- >> > > Regards, Ian Babrou >> > > http://bobrik.name http://twitter.com/ibobrik skype:i.babrou >> > > >> > >> > >> > >> >> >> -- >> Christian Balzer Network/Systems Engineer >> [email protected] Global OnLine Japan/Fusion Communications >> http://www.gol.com/ >> > > > > -- > Regards, Ian Babrou > http://bobrik.name http://twitter.com/ibobrik skype:i.babrou > -- Regards, Ian Babrou http://bobrik.name http://twitter.com/ibobrik skype:i.babrou
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
