Re: [ceph-users] Is ceph osd reweight always safe to use?

Christian Balzer Mon, 08 Sep 2014 23:25:34 -0700

Hello,

On Tue, 09 Sep 2014 01:25:17 -0400 JR wrote:


> Greetings
> 
> After running for a couple of hours, my attempt to re-balance a near ful
> disk has stopped with a stuck unclean error:
> 
Which is exactly what I warned you about below and what you should have
also taken away from fully reading the "Uneven OSD usage" thread.

This also should hammer my previous point about your current cluster
size/utilization home. Even with a better (don't expect perfect) data
distribution, loss of one node might well find you with a full OSD again. 

> root@osd45:~# ceph -s
>   cluster c8122868-27af-11e4-b570-52540004010f
>    health HEALTH_WARN 6 pgs backfilling; 6 pgs stuck unclean; recovery
> 13086/1158268 degraded (1.130%)
>    monmap e1: 3 mons at
> {osd42=10.7.7.142:6789/0,osd43=10.7.7.143:6789/0,osd45=10.7.7.145:6789/0},
> election epoch 80, quorum 0,1,2 osd42,osd43,osd45
>    osdmap e723: 8 osds: 8 up, 8 in
>     pgmap v543113: 640 pgs: 634 active+clean, 6
> active+remapped+backfilling; 2222 GB data, 2239 GB used, 1295 GB / 3535
> GB avail; 8268B/s wr, 0op/s; 13086/1158268 degraded (1.130%)
>    mdsmap e63: 1/1/1 up {0=osd42=up:active}, 3 up:standby
> 
>From what I've read in the past the way forward here is to increase the
full ratio setting so it can finish the recovery. 
Or add more OSDs, at least temporarily. See:
http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity

Read that and apply that knowledge to your cluster, I personally wouldn't
deploy it in this state.

Once the recovery is finished I'd proceed cautiously, see below.

> 
> The sequence of events today that led to this were:
> 
> # starting state: pg_num/pgp_num == 64
> ceph osd pool set rbd pg_num 128
> ceph osd pool set rbd pgp_num 128
> # there was a warning thrown up (which I've lost) and which left pgg_num
> == 64
> # nothing happens since pgp_num was inadvertently not raised
> ceph osd reweight-by-utilization
> # data moves from one osd on a host to another osd on same host
> ceph osd reweight  7 1
> # data moves back to roughly what it had been
Never mind the the lack of PGs to play with, manually lowering the weight
of the fullest OSD (in small steps) at this time might have given you at
least a more level playing field.
 
> ceph osd pool set volumes pg_num 192
> ceph osd pool set volumes pgp_num 192
> # data moves successfully
This would have been the time to check what actually happened and if
things improved or not (just adding PGs/PGPs might not be enough) and
again to manually reweight overly full OSDs.

> ceph osd pool set rbd pg_num 192
> ceph osd pool set rbd pgp_num 192
> # data stuck
> 
Baby steps. As in, applying the rise to 128 PGPs first. 
But I guess you would have run into the full OSD either way w/o
reweighting things between steps.

> googling (nowadays known as research) reveals that these might be
> helpful:
> 
> - ceph osd crush tunables optimal
Yes, this might help. 
Not sure if that works with dumpling, but as I already mentioned dumpling
doesn't support "chooseleaf_vary_r". And hashspool.
And while the data movement caused by this probably will result in a
better balanced cluster (again, with too little PGs it will still do
poorly), in the process of getting there it might still run into a full
OSD scenario.

> - setting crush weights to 1
> 
Dunno about then one, my crush weights were 1 when I deployed things
manually for the first time, the size of the OSD for the 2nd manual
deployment and ceph-deploy also uses the OSD size in TB. 

Christian

> I resist doing anything for now in the hopes that someone has something
> coherent to say (Christian? ;-)
> 
> Thanks
> JR
> 
> 
> On 9/8/2014 10:37 PM, JR wrote:
> > Hi Christian,
> > 
> > Ha ...
> > 
> > root@osd45:~# ceph osd pool get rbd pg_num
> > pg_num: 128
> > root@osd45:~# ceph osd pool get rbd pgp_num
> > pgp_num: 64
> > 
> > That's the explanation!  I did run the command but it spit out some
> > (what I thought was a harmless) warning; should have checked more
> > carefully.
> > 
> > I now have the expected data movement.
> > 
> > Thanks alot!
> > JR
> > 
> > On 9/8/2014 10:04 PM, Christian Balzer wrote:
> >>
> >> Hello,
> >>
> >> On Mon, 08 Sep 2014 18:30:07 -0400 JR wrote:
> >>
> >>> Hi Christian, all,
> >>>
> >>> Having researched this a bit more, it seemed that just doing
> >>>
> >>> ceph osd pool set rbd pg_num 128
> >>> ceph osd pool set rbd pgp_num 128
> >>>
> >>> might be the answer.  Alas, it was not. After running the above the
> >>> cluster just sat there.
> >>>
> >> Really now? No data movement, no health warnings during that in the
> >> logs, no other error in the logs or when issuing that command?
> >> Is it really at 128 now, verified with "ceph osd pool get rbd pg_num"?
> >>
> >> You really want to get this addressed as per the previous reply before
> >> doing anything further. Because with just 64 PGs (as in only 8 per
> >> OSD!) massive imbalances are a given.
> >>
> >>> Finally, reading some more, I ran:
> >>>
> >>>  ceph osd reweight-by-utilization
> >>>
> >> Reading can be dangerous. ^o^
> >>
> >> I didn't mention this, as it never worked for me in any predictable
> >> way and with a desirable outcome, especially in situations like yours.
> >>
> >>> This accomplished moving the utilization of the first drive on the
> >>> affected node to the 2nd drive! .e.g.:
> >>>
> >>> -------
> >>> BEFORE RUNNING:
> >>> -------
> >>> Filesystem     Use%
> >>> /dev/sdc1     57%
> >>> /dev/sdb1     65%
> >>> Filesystem     Use%
> >>> /dev/sdc1     90%
> >>> /dev/sdb1     75%
> >>> Filesystem     Use%
> >>> /dev/sdb1     52%
> >>> /dev/sdc1     52%
> >>> Filesystem     Use%
> >>> /dev/sdc1     54%
> >>> /dev/sdb1     63%
> >>>
> >>> -------
> >>> AFTER RUNNING:
> >>> -------
> >>> Filesystem     Use%
> >>> /dev/sdc1     57%
> >>> /dev/sdb1     65%
> >>> Filesystem     Use%
> >>> /dev/sdc1     70%          ** these two swapped (roughly) **
> >>> /dev/sdb1     92%          ** ^^^^^ ^^^ ^^^^^^^           **
> >>> Filesystem     Use%
> >>> /dev/sdb1     52%
> >>> /dev/sdc1     52%
> >>> Filesystem     Use%
> >>> /dev/sdc1     54%
> >>> /dev/sdb1     63%
> >>>
> >>> root@osd45:~# ceph osd tree
> >>> # id    weight  type name       up/down reweight
> >>> -1      3.44    root default
> >>> -2      0.86            host osd45
> >>> 0       0.43                    osd.0   up      1
> >>> 4       0.43                    osd.4   up      1
> >>> -3      0.86            host osd42
> >>> 1       0.43                    osd.1   up      1
> >>> 5       0.43                    osd.5   up      1
> >>> -4      0.86            host osd44
> >>> 2       0.43                    osd.2   up      1
> >>> 6       0.43                    osd.6   up      1
> >>> -5      0.86            host osd43
> >>> 3       0.43                    osd.3   up      1
> >>> 7       0.43                    osd.7   up      0.7007
> >>>
> >>> So this isn't the answer either.
> >>>
> >> It might have been, if it had more PGs to distribute things along, see
> >> above. But even then with the default dumpling tunables it might not
> >> be much better.
> >>
> >>> Could someone please chime in with an explanation/suggestion?
> >>>
> >>> I suspect that might make sense to use 'ceph osd reweight osd.7 1'
> >>> and then run some form of 'ceph osd crush ...'?
> >>>
> >> No need to crush anything, reweight it to 1 after adding PGs/PGPs and
> >> after all that data movement has finished slowly dial down any still
> >> overly utilized OSD.
> >>
> >> Also per the "Uneven OSD usage" thread, you might run into a "full"
> >> situation during data re-distribution. Increase PGs in small (64)
> >> increments.
> >>
> >>> Of course, I've read a number of things which suggest that the two
> >>> things I've done should have fixed my problem.
> >>>
> >>> Is it (gasp!) possible that this, as Christian suggests, is a
> >>> dumpling issue and, were I running on firefly, it would be
> >>> sufficient?
> >>>
> >> Running Firefly with all the tunables and probably hashpspool. 
> >> Most of the tunables with the exception of "chooseleaf_vary_r" are
> >> available on dumpling, hashpspool isn't AFAIK.
> >> See http://ceph.com/docs/master/rados/operations/crush-map/#tunables
> >>
> >> Christian
> >>>
> >>> Thanks much
> >>> JR
> >>> On 9/8/2014 1:50 PM, JR wrote:
> >>>> Hi Christian,
> >>>>
> >>>> I have 448 PGs and 448 PGPs (according to ceph -s).
> >>>>
> >>>> This seems borne out by:
> >>>>
> >>>> root@osd45:~# rados lspools
> >>>> data
> >>>> metadata
> >>>> rbd
> >>>> volumes
> >>>> images
> >>>> root@osd45:~# for i in $(rados lspools); do echo "$i pg($(ceph osd
> >>>> pool get $i pg_num), pgp$(ceph osd pool get $i pg_num)"; done
> >>>> data pg(pg_num: 64, pgppg_num: 64
> >>>> metadata pg(pg_num: 64, pgppg_num: 64
> >>>> rbd pg(pg_num: 64, pgppg_num: 64
> >>>> volumes pg(pg_num: 128, pgppg_num: 128
> >>>> images pg(pg_num: 128, pgppg_num: 128
> >>>>
> >>>> According to the formula discussed in 'Uneven OSD usage,'
> >>>>
> >>>> "The formula is actually OSDs * 100 / replication
> >>>>
> >>>> in my case:
> >>>>
> >>>> 8*100/2=400
> >>>>
> >>>> So I'm erroring on the large size?
> >>>>
> >>>> Or, does this formula apply on by pool basis?  Of my 5 pools I'm
> >>>> using 3:
> >>>>
> >>>> root@osd45:~# rados df|cut -c1-45
> >>>> pool name       category                 KB
> >>>> data            -                          0
> >>>> images          -                          0
> >>>> metadata        -                         10
> >>>> rbd             -                  568489533
> >>>> volumes         -                  594078601
> >>>>   total used      2326235048       285923
> >>>>   total avail     1380814968
> >>>>   total space     3707050016
> >>>>
> >>>> So should I up the number of PGs for the rbd and volumes pools?
> >>>>
> >>>> I'll continue looking at docs, but for now I'll send this off.
> >>>>
> >>>> Thanks very much, Christain.
> >>>>
> >>>> ps. This cluster is self-contained and all nodes in it are
> >>>> completely loaded (i.e., I can't add any more nodes nor disks).
> >>>> It's also not an option at the moment to upgrade to firefly (can't
> >>>> make a big change before sending it out the door).
> >>>>
> >>>>
> >>>>
> >>>> On 9/8/2014 12:09 PM, Christian Balzer wrote:
> >>>>>
> >>>>> Hello,
> >>>>>
> >>>>> On Mon, 08 Sep 2014 11:42:59 -0400 JR wrote:
> >>>>>
> >>>>>> Greetings all,
> >>>>>>
> >>>>>> I have a small ceph cluster (4 nodes, 2 osds per node) which
> >>>>>> recently started showing:
> >>>>>>
> >>>>>> root@ocd45:~# ceph health
> >>>>>> HEALTH_WARN 1 near full osd(s)
> >>>>>>
> >>>>>> admin@node4:~$ for i in 2 3 4 5; do sudo ssh osd4$i df -h |egrep
> >>>>>> 'Filesystem|osd/ceph'; done
> >>>>>> Filesystem      Size  Used Avail Use% Mounted on
> >>>>>> /dev/sdc1       442G  249G  194G  57% /var/lib/ceph/osd/ceph-5
> >>>>>> /dev/sdb1       442G  287G  156G  65% /var/lib/ceph/osd/ceph-1
> >>>>>> Filesystem      Size  Used Avail Use% Mounted on
> >>>>>> /dev/sdc1       442G  396G   47G  90% /var/lib/ceph/osd/ceph-7
> >>>>>> /dev/sdb1       442G  316G  127G  72% /var/lib/ceph/osd/ceph-3
> >>>>>> Filesystem      Size  Used Avail Use% Mounted on
> >>>>>> /dev/sdb1       442G  229G  214G  52% /var/lib/ceph/osd/ceph-2
> >>>>>> /dev/sdc1       442G  229G  214G  52% /var/lib/ceph/osd/ceph-6
> >>>>>> Filesystem      Size  Used Avail Use% Mounted on
> >>>>>> /dev/sdc1       442G  238G  205G  54% /var/lib/ceph/osd/ceph-4
> >>>>>> /dev/sdb1       442G  278G  165G  63% /var/lib/ceph/osd/ceph-0
> >>>>>>
> >>>>>>
> >>>>> See the very recent "Uneven OSD usage" for a discussion about this.
> >>>>> What are your PG/PGP values?
> >>>>>
> >>>>>> This cluster has been running for weeks, under significant load,
> >>>>>> and has been 100% stable. Unfortunately we have to ship it out of
> >>>>>> the building to another part of our business (where we will have
> >>>>>> little access to it).
> >>>>>>
> >>>>>> Based on what I've read about 'ceph osd reweight' I'm a bit
> >>>>>> hesitant to just run it (I don't want to do anything that impacts
> >>>>>> this cluster's stability).
> >>>>>>
> >>>>>> Is there another, better way to equalize the distribution the
> >>>>>> data on the osd partitions?
> >>>>>>
> >>>>>> I'm running dumpling.
> >>>>>>
> >>>>> As per the thread and my experience, Firefly would solve this. If
> >>>>> you can upgrade during a weekend or whenever there is little to no
> >>>>> access, do it.
> >>>>>
> >>>>> Another option (of course any and all of these will result in data
> >>>>> movement, so pick an appropriate time), would be to "use ceph osd
> >>>>> reweight" to lower the weight of osd.7 in particular.
> >>>>>
> >>>>> Lastly, given the utilization of your cluster, your really ought to
> >>>>> deploy more OSDs and/or more nodes, if a node would go down you'd
> >>>>> easily get into a "real" near full or full situation.
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>> Christian
> >>>>>
> >>>>
> >>>
> >>
> >>
> > 
> 


-- 
Christian Balzer        Network/Systems Engineer                
[email protected]           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Is ceph osd reweight always safe to use?

Reply via email to