Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?

Andrija Panic Thu, 05 Mar 2015 14:18:14 -0800

Thanks a lot Robert.

I have actually already tried folowing:


a) set one OSD to out (6% of data misplaced, CEPH recovered fine), stop
OSD, remove OSD from crush map (again 36% of data misplaced !!!) - then
inserted OSD back in to crushmap - and those 36% displaced objects
disappeared, of course - I'v undone the crush remove...
so damage undone - the OSD is just "out" and cluster healthy again.


b) set norecover, nobackfill, and then:
    - Remove one OSD from crush (the running OSD, not the one from point a)
- only 18% of data misplaced !!! (no recovery was happening though, because
of norecover, nobackfill)
    - Removed another OSD from same node - total of only 20% of objects
missplaced (with 2 OSDs on same node, removed from crush map)
    -So these 2 OSD were still running UP and IN, and I just removed them
from crush map, per the advice to avoid calcualting Crush map twice = from:
http://image.slidesharecdn.com/scalingcephatcern-140311134847-phpapp01/95/scaling-ceph-at-cern-ceph-day-frankfurt-19-638.jpg?cb=1394564547
- And I added back this 2 OSD to crush map, this was just a test...

So the algorith is very funny in some aspect..but it's all pseudo stuff so
I kind of understand...

I will share my finding during the rest of the OSD demotion, after I demote
them...

Thanks for your detailed inputs !
Andrija


On 5 March 2015 at 22:51, Robert LeBlanc <rob...@leblancnet.us> wrote:

> Setting an OSD out will start the rebalance with the degraded object
> count. The OSD is still alive and can participate in the relocation of the
> objects. This is preferable so that you don't happen to get less the
> min_size because a disk fails during the rebalance then I/O stops on the
> cluster.
>
> Because CRUSH is an algorithm, anything that changes algorithm will cause
> a change in the output (location). When you set/fail an OSD, it changes the
> CRUSH, but the host and weight of the host are still in effect. When you
> remove the host or change the weight of the host (by removing a single
> OSD), it makes a change to the algorithm which will also cause some changes
> in how it computes the locations.
>
> Disclaimer - I have not tried this
>
> It may be possible to minimize the data movement by doing the following:
>
>    1. set norecover and nobackfill on the cluster
>    2. Set the OSDs to be removed to "out"
>    3. Adjust the weight of the hosts in the CRUSH (if removing all OSDs
>    for the host, set it to zero)
>    4. If you have new OSDs to add, add them into the cluster now
>    5. Once all OSDs changes have been entered, unset norecover and
>    nobackfill
>    6. This will migrate the data off the old OSDs and onto the new OSDs
>    in one swoop.
>    7. Once the data migration is complete, set norecover and nobackfill
>    on the cluster again.
>    8. Remove the old OSDs
>    9. Unset norecover and nobackfill
>
> The theory is that by setting the host weights to 0, removing the
> OSDs/hosts later should minimize the data movement afterwards because the
> algorithm should have already dropped it out as a candidate for placement.
>
> If this works right, then you basically queue up a bunch of small changes,
> do one data movement, always keep all copies of your objects online and
> minimize the impact of the data movement by leveraging both your old and
> new hardware at the same time.
>
> If you try this, please report back on your experience. I'm might try it
> in my lab, but I'm really busy at the moment so I don't know if I'll get to
> it real soon.
>
> On Thu, Mar 5, 2015 at 12:53 PM, Andrija Panic <andrija.pa...@gmail.com>
> wrote:
>
>> Hi Robert,
>>
>> it seems I have not listened well on your advice - I set osd to out,
>> instead of stoping it - and now instead of some ~ 3% of degraded objects,
>> now there is 0.000% of degraded, and arround 6% misplaced - and rebalancing
>> is happening again, but this is small percentage..
>>
>> Do you know if later when I remove this OSD from crush map - no more data
>> will be rebalanced (as per CEPH official documentation) - since already
>> missplaced objects are geting distributed away to all other nodes ?
>>
>> (after service ceph stop osd.0 - there was 2.45% degraded data - but no
>> backfilling was happening for some reason...it just stayed degraded... so
>> this is a reason why I started back the OSD, and then set it to out...)
>>
>> Thanks
>>
>> On 4 March 2015 at 17:54, Andrija Panic <andrija.pa...@gmail.com> wrote:
>>
>>> Hi Robert,
>>>
>>> I already have this stuff set. CEph is 0.87.0 now...
>>>
>>> Thanks, will schedule this for weekend, 10G network and 36 OSDs - should
>>> move data in less than 8h per my last experineced that was arround8h, but
>>> some 1G OSDs were included...
>>>
>>> Thx!
>>>
>>> On 4 March 2015 at 17:49, Robert LeBlanc <rob...@leblancnet.us> wrote:
>>>
>>>> You will most likely have a very high relocation percentage. Backfills
>>>> always are more impactful on smaller clusters, but "osd max backfills"
>>>> should be what you need to help reduce the impact. The default is 10,
>>>> you will want to use 1.
>>>>
>>>> I didn't catch which version of Ceph you are running, but I think
>>>> there was some priority work done in firefly to help make backfills
>>>> lower priority. I think it has gotten better in later versions.
>>>>
>>>> On Wed, Mar 4, 2015 at 1:35 AM, Andrija Panic <andrija.pa...@gmail.com>
>>>> wrote:
>>>> > Thank you Rober - I'm wondering when I do remove total of 7 OSDs from
>>>> crush
>>>> > map - weather that will cause more than 37% of data moved (80% or
>>>> whatever)
>>>> >
>>>> > I'm also wondering if the thortling that I applied is fine or not - I
>>>> will
>>>> > introduce the osd_recovery_delay_start 10sec as Irek said.
>>>> >
>>>> > I'm just wondering hom much will be the performance impact, because:
>>>> > - when stoping OSD, the impact while backfilling was fine more or a
>>>> less - I
>>>> > can leave with this
>>>> > - when I removed OSD from cursh map - first 1h or so, impact was
>>>> tremendous,
>>>> > and later on during recovery process impact was much less but still
>>>> > noticable...
>>>> >
>>>> > Thanks for the tip of course !
>>>> > Andrija
>>>> >
>>>> > On 3 March 2015 at 18:34, Robert LeBlanc <rob...@leblancnet.us>
>>>> wrote:
>>>> >>
>>>> >> I would be inclined to shut down both OSDs in a node, let the cluster
>>>> >> recover. Once it is recovered, shut down the next two, let it
>>>> recover.
>>>> >> Repeat until all the OSDs are taken out of the cluster. Then I would
>>>> >> set nobackfill and norecover. Then remove the hosts/disks from the
>>>> >> CRUSH then unset nobackfill and norecover.
>>>> >>
>>>> >> That should give you a few small changes (when you shut down OSDs)
>>>> and
>>>> >> then one big one to get everything in the final place. If you are
>>>> >> still adding new nodes, when nobackfill and norecover is set, you can
>>>> >> add them in so that the one big relocate fills the new drives too.
>>>> >>
>>>> >> On Tue, Mar 3, 2015 at 5:58 AM, Andrija Panic <
>>>> andrija.pa...@gmail.com>
>>>> >> wrote:
>>>> >> > Thx Irek. Number of replicas is 3.
>>>> >> >
>>>> >> > I have 3 servers with 2 OSDs on them on 1g switch (1 OSD already
>>>> >> > decommissioned), which is further connected to a new 10G
>>>> switch/network
>>>> >> > with
>>>> >> > 3 servers on it with 12 OSDs each.
>>>> >> > I'm decommissioning old 3 nodes on 1G network...
>>>> >> >
>>>> >> > So you suggest removing whole node with 2 OSDs manually from crush
>>>> map?
>>>> >> > Per my knowledge, ceph never places 2 replicas on 1 node, all 3
>>>> replicas
>>>> >> > were originally been distributed over all 3 nodes. So anyway It
>>>> could be
>>>> >> > safe to remove 2 OSDs at once together with the node itself...since
>>>> >> > replica
>>>> >> > count is 3...
>>>> >> > ?
>>>> >> >
>>>> >> > Thx again for your time
>>>> >> >
>>>> >> > On Mar 3, 2015 1:35 PM, "Irek Fasikhov" <malm...@gmail.com> wrote:
>>>> >> >>
>>>> >> >> Once you have only three nodes in the cluster.
>>>> >> >> I recommend you add new nodes to the cluster, and then delete the
>>>> old.
>>>> >> >>
>>>> >> >> 2015-03-03 15:28 GMT+03:00 Irek Fasikhov <malm...@gmail.com>:
>>>> >> >>>
>>>> >> >>> You have a number of replication?
>>>> >> >>>
>>>> >> >>> 2015-03-03 15:14 GMT+03:00 Andrija Panic <
>>>> andrija.pa...@gmail.com>:
>>>> >> >>>>
>>>> >> >>>> Hi Irek,
>>>> >> >>>>
>>>> >> >>>> yes, stoping OSD (or seting it to OUT) resulted in only 3% of
>>>> data
>>>> >> >>>> degraded and moved/recovered.
>>>> >> >>>> When I after that removed it from Crush map "ceph osd crush rm
>>>> id",
>>>> >> >>>> that's when the stuff with 37% happened.
>>>> >> >>>>
>>>> >> >>>> And thanks Irek for help - could you kindly just let me know of
>>>> the
>>>> >> >>>> prefered steps when removing whole node?
>>>> >> >>>> Do you mean I first stop all OSDs again, or just remove each
>>>> OSD from
>>>> >> >>>> crush map, or perhaps, just decompile cursh map, delete the node
>>>> >> >>>> completely,
>>>> >> >>>> compile back in, and let it heal/recover ?
>>>> >> >>>>
>>>> >> >>>> Do you think this would result in less data missplaces and moved
>>>> >> >>>> arround
>>>> >> >>>> ?
>>>> >> >>>>
>>>> >> >>>> Sorry for bugging you, I really appreaciate your help.
>>>> >> >>>>
>>>> >> >>>> Thanks
>>>> >> >>>>
>>>> >> >>>> On 3 March 2015 at 12:58, Irek Fasikhov <malm...@gmail.com>
>>>> wrote:
>>>> >> >>>>>
>>>> >> >>>>> A large percentage of the rebuild of the cluster map (But low
>>>> >> >>>>> percentage degradation). If you had not made "ceph osd crush
>>>> rm id",
>>>> >> >>>>> the
>>>> >> >>>>> percentage would be low.
>>>> >> >>>>> In your case, the correct option is to remove the entire node,
>>>> >> >>>>> rather
>>>> >> >>>>> than each disk individually
>>>> >> >>>>>
>>>> >> >>>>> 2015-03-03 14:27 GMT+03:00 Andrija Panic <
>>>> andrija.pa...@gmail.com>:
>>>> >> >>>>>>
>>>> >> >>>>>> Another question - I mentioned here 37% of objects being moved
>>>> >> >>>>>> arround
>>>> >> >>>>>> - this is MISPLACED object (degraded objects were 0.001%,
>>>> after I
>>>> >> >>>>>> removed 1
>>>> >> >>>>>> OSD from cursh map (out of 44 OSD or so).
>>>> >> >>>>>>
>>>> >> >>>>>> Can anybody confirm this is normal behaviour - and are there
>>>> any
>>>> >> >>>>>> workarrounds ?
>>>> >> >>>>>>
>>>> >> >>>>>> I understand this is because of the object placement
>>>> algorithm of
>>>> >> >>>>>> CEPH, but still 37% of object missplaces just by removing 1
>>>> OSD
>>>> >> >>>>>> from crush
>>>> >> >>>>>> maps out of 44 make me wonder why this large percentage ?
>>>> >> >>>>>>
>>>> >> >>>>>> Seems not good to me, and I have to remove another 7 OSDs (we
>>>> are
>>>> >> >>>>>> demoting some old hardware nodes). This means I can
>>>> potentialy go
>>>> >> >>>>>> with 7 x
>>>> >> >>>>>> the same number of missplaced objects...?
>>>> >> >>>>>>
>>>> >> >>>>>> Any thoughts ?
>>>> >> >>>>>>
>>>> >> >>>>>> Thanks
>>>> >> >>>>>>
>>>> >> >>>>>> On 3 March 2015 at 12:14, Andrija Panic <
>>>> andrija.pa...@gmail.com>
>>>> >> >>>>>> wrote:
>>>> >> >>>>>>>
>>>> >> >>>>>>> Thanks Irek.
>>>> >> >>>>>>>
>>>> >> >>>>>>> Does this mean, that after peering for each PG, there will be
>>>> >> >>>>>>> delay
>>>> >> >>>>>>> of 10sec, meaning that every once in a while, I will have
>>>> 10sec od
>>>> >> >>>>>>> the
>>>> >> >>>>>>> cluster NOT being stressed/overloaded, and then the recovery
>>>> takes
>>>> >> >>>>>>> place for
>>>> >> >>>>>>> that PG, and then another 10sec cluster is fine, and then
>>>> stressed
>>>> >> >>>>>>> again ?
>>>> >> >>>>>>>
>>>> >> >>>>>>> I'm trying to understand process before actually doing stuff
>>>> >> >>>>>>> (config
>>>> >> >>>>>>> reference is there on ceph.com but I don't fully understand
>>>> the
>>>> >> >>>>>>> process)
>>>> >> >>>>>>>
>>>> >> >>>>>>> Thanks,
>>>> >> >>>>>>> Andrija
>>>> >> >>>>>>>
>>>> >> >>>>>>> On 3 March 2015 at 11:32, Irek Fasikhov <malm...@gmail.com>
>>>> wrote:
>>>> >> >>>>>>>>
>>>> >> >>>>>>>> Hi.
>>>> >> >>>>>>>>
>>>> >> >>>>>>>> Use value "osd_recovery_delay_start"
>>>> >> >>>>>>>> example:
>>>> >> >>>>>>>> [root@ceph08 ceph]# ceph --admin-daemon
>>>> >> >>>>>>>> /var/run/ceph/ceph-osd.94.asok config show  | grep
>>>> >> >>>>>>>> osd_recovery_delay_start
>>>> >> >>>>>>>>   "osd_recovery_delay_start": "10"
>>>> >> >>>>>>>>
>>>> >> >>>>>>>> 2015-03-03 13:13 GMT+03:00 Andrija Panic
>>>> >> >>>>>>>> <andrija.pa...@gmail.com>:
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> HI Guys,
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> I yesterday removed 1 OSD from cluster (out of 42 OSDs),
>>>> and it
>>>> >> >>>>>>>>> caused over 37% od the data to rebalance - let's say this
>>>> is
>>>> >> >>>>>>>>> fine (this is
>>>> >> >>>>>>>>> when I removed it frm Crush Map).
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> I'm wondering - I have previously set some throtling
>>>> mechanism,
>>>> >> >>>>>>>>> but
>>>> >> >>>>>>>>> during first 1h of rebalancing, my rate of recovery was
>>>> going up
>>>> >> >>>>>>>>> to 1500
>>>> >> >>>>>>>>> MB/s - and VMs were unusable completely, and then last 4h
>>>> of the
>>>> >> >>>>>>>>> duration of
>>>> >> >>>>>>>>> recover this recovery rate went down to, say, 100-200 MB.s
>>>> and
>>>> >> >>>>>>>>> during this
>>>> >> >>>>>>>>> VM performance was still pretty impacted, but at least I
>>>> could
>>>> >> >>>>>>>>> work more or
>>>> >> >>>>>>>>> a less
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> So my question, is this behaviour expected, is throtling
>>>> here
>>>> >> >>>>>>>>> working as expected, since first 1h was almoust no
>>>> throtling
>>>> >> >>>>>>>>> applied if I
>>>> >> >>>>>>>>> check the recovery rate 1500MB/s and the impact on Vms.
>>>> >> >>>>>>>>> And last 4h seemed pretty fine (although still lot of
>>>> impact in
>>>> >> >>>>>>>>> general)
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> I changed these throtling on the fly with:
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> ceph tell osd.* injectargs '--osd_recovery_max_active 1'
>>>> >> >>>>>>>>> ceph tell osd.* injectargs '--osd_recovery_op_priority 1'
>>>> >> >>>>>>>>> ceph tell osd.* injectargs '--osd_max_backfills 1'
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> My Jorunals are on SSDs (12 OSD per server, of which 6
>>>> journals
>>>> >> >>>>>>>>> on
>>>> >> >>>>>>>>> one SSD, 6 journals on another SSD)  - I have 3 of these
>>>> hosts.
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> Any thought are welcome.
>>>> >> >>>>>>>>> --
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> Andrija Panić
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>> _______________________________________________
>>>> >> >>>>>>>>> ceph-users mailing list
>>>> >> >>>>>>>>> ceph-users@lists.ceph.com
>>>> >> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> >> >>>>>>>>>
>>>> >> >>>>>>>>
>>>> >> >>>>>>>>
>>>> >> >>>>>>>>
>>>> >> >>>>>>>> --
>>>> >> >>>>>>>> С уважением, Фасихов Ирек Нургаязович
>>>> >> >>>>>>>> Моб.: +79229045757
>>>> >> >>>>>>>
>>>> >> >>>>>>>
>>>> >> >>>>>>>
>>>> >> >>>>>>>
>>>> >> >>>>>>> --
>>>> >> >>>>>>>
>>>> >> >>>>>>> Andrija Panić
>>>> >> >>>>>>
>>>> >> >>>>>>
>>>> >> >>>>>>
>>>> >> >>>>>>
>>>> >> >>>>>> --
>>>> >> >>>>>>
>>>> >> >>>>>> Andrija Panić
>>>> >> >>>>>
>>>> >> >>>>>
>>>> >> >>>>>
>>>> >> >>>>>
>>>> >> >>>>> --
>>>> >> >>>>> С уважением, Фасихов Ирек Нургаязович
>>>> >> >>>>> Моб.: +79229045757
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>> --
>>>> >> >>>>
>>>> >> >>>> Andrija Panić
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> --
>>>> >> >>> С уважением, Фасихов Ирек Нургаязович
>>>> >> >>> Моб.: +79229045757
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >> --
>>>> >> >> С уважением, Фасихов Ирек Нургаязович
>>>> >> >> Моб.: +79229045757
>>>> >> >
>>>> >> >
>>>> >> > _______________________________________________
>>>> >> > ceph-users mailing list
>>>> >> > ceph-users@lists.ceph.com
>>>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> >> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> >
>>>> > Andrija Panić
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Andrija Panić
>>>
>>
>>
>>
>> --
>>
>> Andrija Panić
>>
>
>


-- 

Andrija Panić

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?

Reply via email to