Thanks a lot Robert. I have actually already tried folowing:
a) set one OSD to out (6% of data misplaced, CEPH recovered fine), stop OSD, remove OSD from crush map (again 36% of data misplaced !!!) - then inserted OSD back in to crushmap - and those 36% displaced objects disappeared, of course - I'v undone the crush remove... so damage undone - the OSD is just "out" and cluster healthy again. b) set norecover, nobackfill, and then: - Remove one OSD from crush (the running OSD, not the one from point a) - only 18% of data misplaced !!! (no recovery was happening though, because of norecover, nobackfill) - Removed another OSD from same node - total of only 20% of objects missplaced (with 2 OSDs on same node, removed from crush map) -So these 2 OSD were still running UP and IN, and I just removed them from crush map, per the advice to avoid calcualting Crush map twice = from: http://image.slidesharecdn.com/scalingcephatcern-140311134847-phpapp01/95/scaling-ceph-at-cern-ceph-day-frankfurt-19-638.jpg?cb=1394564547 - And I added back this 2 OSD to crush map, this was just a test... So the algorith is very funny in some aspect..but it's all pseudo stuff so I kind of understand... I will share my finding during the rest of the OSD demotion, after I demote them... Thanks for your detailed inputs ! Andrija On 5 March 2015 at 22:51, Robert LeBlanc <rob...@leblancnet.us> wrote: > Setting an OSD out will start the rebalance with the degraded object > count. The OSD is still alive and can participate in the relocation of the > objects. This is preferable so that you don't happen to get less the > min_size because a disk fails during the rebalance then I/O stops on the > cluster. > > Because CRUSH is an algorithm, anything that changes algorithm will cause > a change in the output (location). When you set/fail an OSD, it changes the > CRUSH, but the host and weight of the host are still in effect. When you > remove the host or change the weight of the host (by removing a single > OSD), it makes a change to the algorithm which will also cause some changes > in how it computes the locations. > > Disclaimer - I have not tried this > > It may be possible to minimize the data movement by doing the following: > > 1. set norecover and nobackfill on the cluster > 2. Set the OSDs to be removed to "out" > 3. Adjust the weight of the hosts in the CRUSH (if removing all OSDs > for the host, set it to zero) > 4. If you have new OSDs to add, add them into the cluster now > 5. Once all OSDs changes have been entered, unset norecover and > nobackfill > 6. This will migrate the data off the old OSDs and onto the new OSDs > in one swoop. > 7. Once the data migration is complete, set norecover and nobackfill > on the cluster again. > 8. Remove the old OSDs > 9. Unset norecover and nobackfill > > The theory is that by setting the host weights to 0, removing the > OSDs/hosts later should minimize the data movement afterwards because the > algorithm should have already dropped it out as a candidate for placement. > > If this works right, then you basically queue up a bunch of small changes, > do one data movement, always keep all copies of your objects online and > minimize the impact of the data movement by leveraging both your old and > new hardware at the same time. > > If you try this, please report back on your experience. I'm might try it > in my lab, but I'm really busy at the moment so I don't know if I'll get to > it real soon. > > On Thu, Mar 5, 2015 at 12:53 PM, Andrija Panic <andrija.pa...@gmail.com> > wrote: > >> Hi Robert, >> >> it seems I have not listened well on your advice - I set osd to out, >> instead of stoping it - and now instead of some ~ 3% of degraded objects, >> now there is 0.000% of degraded, and arround 6% misplaced - and rebalancing >> is happening again, but this is small percentage.. >> >> Do you know if later when I remove this OSD from crush map - no more data >> will be rebalanced (as per CEPH official documentation) - since already >> missplaced objects are geting distributed away to all other nodes ? >> >> (after service ceph stop osd.0 - there was 2.45% degraded data - but no >> backfilling was happening for some reason...it just stayed degraded... so >> this is a reason why I started back the OSD, and then set it to out...) >> >> Thanks >> >> On 4 March 2015 at 17:54, Andrija Panic <andrija.pa...@gmail.com> wrote: >> >>> Hi Robert, >>> >>> I already have this stuff set. CEph is 0.87.0 now... >>> >>> Thanks, will schedule this for weekend, 10G network and 36 OSDs - should >>> move data in less than 8h per my last experineced that was arround8h, but >>> some 1G OSDs were included... >>> >>> Thx! >>> >>> On 4 March 2015 at 17:49, Robert LeBlanc <rob...@leblancnet.us> wrote: >>> >>>> You will most likely have a very high relocation percentage. Backfills >>>> always are more impactful on smaller clusters, but "osd max backfills" >>>> should be what you need to help reduce the impact. The default is 10, >>>> you will want to use 1. >>>> >>>> I didn't catch which version of Ceph you are running, but I think >>>> there was some priority work done in firefly to help make backfills >>>> lower priority. I think it has gotten better in later versions. >>>> >>>> On Wed, Mar 4, 2015 at 1:35 AM, Andrija Panic <andrija.pa...@gmail.com> >>>> wrote: >>>> > Thank you Rober - I'm wondering when I do remove total of 7 OSDs from >>>> crush >>>> > map - weather that will cause more than 37% of data moved (80% or >>>> whatever) >>>> > >>>> > I'm also wondering if the thortling that I applied is fine or not - I >>>> will >>>> > introduce the osd_recovery_delay_start 10sec as Irek said. >>>> > >>>> > I'm just wondering hom much will be the performance impact, because: >>>> > - when stoping OSD, the impact while backfilling was fine more or a >>>> less - I >>>> > can leave with this >>>> > - when I removed OSD from cursh map - first 1h or so, impact was >>>> tremendous, >>>> > and later on during recovery process impact was much less but still >>>> > noticable... >>>> > >>>> > Thanks for the tip of course ! >>>> > Andrija >>>> > >>>> > On 3 March 2015 at 18:34, Robert LeBlanc <rob...@leblancnet.us> >>>> wrote: >>>> >> >>>> >> I would be inclined to shut down both OSDs in a node, let the cluster >>>> >> recover. Once it is recovered, shut down the next two, let it >>>> recover. >>>> >> Repeat until all the OSDs are taken out of the cluster. Then I would >>>> >> set nobackfill and norecover. Then remove the hosts/disks from the >>>> >> CRUSH then unset nobackfill and norecover. >>>> >> >>>> >> That should give you a few small changes (when you shut down OSDs) >>>> and >>>> >> then one big one to get everything in the final place. If you are >>>> >> still adding new nodes, when nobackfill and norecover is set, you can >>>> >> add them in so that the one big relocate fills the new drives too. >>>> >> >>>> >> On Tue, Mar 3, 2015 at 5:58 AM, Andrija Panic < >>>> andrija.pa...@gmail.com> >>>> >> wrote: >>>> >> > Thx Irek. Number of replicas is 3. >>>> >> > >>>> >> > I have 3 servers with 2 OSDs on them on 1g switch (1 OSD already >>>> >> > decommissioned), which is further connected to a new 10G >>>> switch/network >>>> >> > with >>>> >> > 3 servers on it with 12 OSDs each. >>>> >> > I'm decommissioning old 3 nodes on 1G network... >>>> >> > >>>> >> > So you suggest removing whole node with 2 OSDs manually from crush >>>> map? >>>> >> > Per my knowledge, ceph never places 2 replicas on 1 node, all 3 >>>> replicas >>>> >> > were originally been distributed over all 3 nodes. So anyway It >>>> could be >>>> >> > safe to remove 2 OSDs at once together with the node itself...since >>>> >> > replica >>>> >> > count is 3... >>>> >> > ? >>>> >> > >>>> >> > Thx again for your time >>>> >> > >>>> >> > On Mar 3, 2015 1:35 PM, "Irek Fasikhov" <malm...@gmail.com> wrote: >>>> >> >> >>>> >> >> Once you have only three nodes in the cluster. >>>> >> >> I recommend you add new nodes to the cluster, and then delete the >>>> old. >>>> >> >> >>>> >> >> 2015-03-03 15:28 GMT+03:00 Irek Fasikhov <malm...@gmail.com>: >>>> >> >>> >>>> >> >>> You have a number of replication? >>>> >> >>> >>>> >> >>> 2015-03-03 15:14 GMT+03:00 Andrija Panic < >>>> andrija.pa...@gmail.com>: >>>> >> >>>> >>>> >> >>>> Hi Irek, >>>> >> >>>> >>>> >> >>>> yes, stoping OSD (or seting it to OUT) resulted in only 3% of >>>> data >>>> >> >>>> degraded and moved/recovered. >>>> >> >>>> When I after that removed it from Crush map "ceph osd crush rm >>>> id", >>>> >> >>>> that's when the stuff with 37% happened. >>>> >> >>>> >>>> >> >>>> And thanks Irek for help - could you kindly just let me know of >>>> the >>>> >> >>>> prefered steps when removing whole node? >>>> >> >>>> Do you mean I first stop all OSDs again, or just remove each >>>> OSD from >>>> >> >>>> crush map, or perhaps, just decompile cursh map, delete the node >>>> >> >>>> completely, >>>> >> >>>> compile back in, and let it heal/recover ? >>>> >> >>>> >>>> >> >>>> Do you think this would result in less data missplaces and moved >>>> >> >>>> arround >>>> >> >>>> ? >>>> >> >>>> >>>> >> >>>> Sorry for bugging you, I really appreaciate your help. >>>> >> >>>> >>>> >> >>>> Thanks >>>> >> >>>> >>>> >> >>>> On 3 March 2015 at 12:58, Irek Fasikhov <malm...@gmail.com> >>>> wrote: >>>> >> >>>>> >>>> >> >>>>> A large percentage of the rebuild of the cluster map (But low >>>> >> >>>>> percentage degradation). If you had not made "ceph osd crush >>>> rm id", >>>> >> >>>>> the >>>> >> >>>>> percentage would be low. >>>> >> >>>>> In your case, the correct option is to remove the entire node, >>>> >> >>>>> rather >>>> >> >>>>> than each disk individually >>>> >> >>>>> >>>> >> >>>>> 2015-03-03 14:27 GMT+03:00 Andrija Panic < >>>> andrija.pa...@gmail.com>: >>>> >> >>>>>> >>>> >> >>>>>> Another question - I mentioned here 37% of objects being moved >>>> >> >>>>>> arround >>>> >> >>>>>> - this is MISPLACED object (degraded objects were 0.001%, >>>> after I >>>> >> >>>>>> removed 1 >>>> >> >>>>>> OSD from cursh map (out of 44 OSD or so). >>>> >> >>>>>> >>>> >> >>>>>> Can anybody confirm this is normal behaviour - and are there >>>> any >>>> >> >>>>>> workarrounds ? >>>> >> >>>>>> >>>> >> >>>>>> I understand this is because of the object placement >>>> algorithm of >>>> >> >>>>>> CEPH, but still 37% of object missplaces just by removing 1 >>>> OSD >>>> >> >>>>>> from crush >>>> >> >>>>>> maps out of 44 make me wonder why this large percentage ? >>>> >> >>>>>> >>>> >> >>>>>> Seems not good to me, and I have to remove another 7 OSDs (we >>>> are >>>> >> >>>>>> demoting some old hardware nodes). This means I can >>>> potentialy go >>>> >> >>>>>> with 7 x >>>> >> >>>>>> the same number of missplaced objects...? >>>> >> >>>>>> >>>> >> >>>>>> Any thoughts ? >>>> >> >>>>>> >>>> >> >>>>>> Thanks >>>> >> >>>>>> >>>> >> >>>>>> On 3 March 2015 at 12:14, Andrija Panic < >>>> andrija.pa...@gmail.com> >>>> >> >>>>>> wrote: >>>> >> >>>>>>> >>>> >> >>>>>>> Thanks Irek. >>>> >> >>>>>>> >>>> >> >>>>>>> Does this mean, that after peering for each PG, there will be >>>> >> >>>>>>> delay >>>> >> >>>>>>> of 10sec, meaning that every once in a while, I will have >>>> 10sec od >>>> >> >>>>>>> the >>>> >> >>>>>>> cluster NOT being stressed/overloaded, and then the recovery >>>> takes >>>> >> >>>>>>> place for >>>> >> >>>>>>> that PG, and then another 10sec cluster is fine, and then >>>> stressed >>>> >> >>>>>>> again ? >>>> >> >>>>>>> >>>> >> >>>>>>> I'm trying to understand process before actually doing stuff >>>> >> >>>>>>> (config >>>> >> >>>>>>> reference is there on ceph.com but I don't fully understand >>>> the >>>> >> >>>>>>> process) >>>> >> >>>>>>> >>>> >> >>>>>>> Thanks, >>>> >> >>>>>>> Andrija >>>> >> >>>>>>> >>>> >> >>>>>>> On 3 March 2015 at 11:32, Irek Fasikhov <malm...@gmail.com> >>>> wrote: >>>> >> >>>>>>>> >>>> >> >>>>>>>> Hi. >>>> >> >>>>>>>> >>>> >> >>>>>>>> Use value "osd_recovery_delay_start" >>>> >> >>>>>>>> example: >>>> >> >>>>>>>> [root@ceph08 ceph]# ceph --admin-daemon >>>> >> >>>>>>>> /var/run/ceph/ceph-osd.94.asok config show | grep >>>> >> >>>>>>>> osd_recovery_delay_start >>>> >> >>>>>>>> "osd_recovery_delay_start": "10" >>>> >> >>>>>>>> >>>> >> >>>>>>>> 2015-03-03 13:13 GMT+03:00 Andrija Panic >>>> >> >>>>>>>> <andrija.pa...@gmail.com>: >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> HI Guys, >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> I yesterday removed 1 OSD from cluster (out of 42 OSDs), >>>> and it >>>> >> >>>>>>>>> caused over 37% od the data to rebalance - let's say this >>>> is >>>> >> >>>>>>>>> fine (this is >>>> >> >>>>>>>>> when I removed it frm Crush Map). >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> I'm wondering - I have previously set some throtling >>>> mechanism, >>>> >> >>>>>>>>> but >>>> >> >>>>>>>>> during first 1h of rebalancing, my rate of recovery was >>>> going up >>>> >> >>>>>>>>> to 1500 >>>> >> >>>>>>>>> MB/s - and VMs were unusable completely, and then last 4h >>>> of the >>>> >> >>>>>>>>> duration of >>>> >> >>>>>>>>> recover this recovery rate went down to, say, 100-200 MB.s >>>> and >>>> >> >>>>>>>>> during this >>>> >> >>>>>>>>> VM performance was still pretty impacted, but at least I >>>> could >>>> >> >>>>>>>>> work more or >>>> >> >>>>>>>>> a less >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> So my question, is this behaviour expected, is throtling >>>> here >>>> >> >>>>>>>>> working as expected, since first 1h was almoust no >>>> throtling >>>> >> >>>>>>>>> applied if I >>>> >> >>>>>>>>> check the recovery rate 1500MB/s and the impact on Vms. >>>> >> >>>>>>>>> And last 4h seemed pretty fine (although still lot of >>>> impact in >>>> >> >>>>>>>>> general) >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> I changed these throtling on the fly with: >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> ceph tell osd.* injectargs '--osd_recovery_max_active 1' >>>> >> >>>>>>>>> ceph tell osd.* injectargs '--osd_recovery_op_priority 1' >>>> >> >>>>>>>>> ceph tell osd.* injectargs '--osd_max_backfills 1' >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> My Jorunals are on SSDs (12 OSD per server, of which 6 >>>> journals >>>> >> >>>>>>>>> on >>>> >> >>>>>>>>> one SSD, 6 journals on another SSD) - I have 3 of these >>>> hosts. >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> Any thought are welcome. >>>> >> >>>>>>>>> -- >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> Andrija Panić >>>> >> >>>>>>>>> >>>> >> >>>>>>>>> _______________________________________________ >>>> >> >>>>>>>>> ceph-users mailing list >>>> >> >>>>>>>>> ceph-users@lists.ceph.com >>>> >> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >> >>>>>>>>> >>>> >> >>>>>>>> >>>> >> >>>>>>>> >>>> >> >>>>>>>> >>>> >> >>>>>>>> -- >>>> >> >>>>>>>> С уважением, Фасихов Ирек Нургаязович >>>> >> >>>>>>>> Моб.: +79229045757 >>>> >> >>>>>>> >>>> >> >>>>>>> >>>> >> >>>>>>> >>>> >> >>>>>>> >>>> >> >>>>>>> -- >>>> >> >>>>>>> >>>> >> >>>>>>> Andrija Panić >>>> >> >>>>>> >>>> >> >>>>>> >>>> >> >>>>>> >>>> >> >>>>>> >>>> >> >>>>>> -- >>>> >> >>>>>> >>>> >> >>>>>> Andrija Panić >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> -- >>>> >> >>>>> С уважением, Фасихов Ирек Нургаязович >>>> >> >>>>> Моб.: +79229045757 >>>> >> >>>> >>>> >> >>>> >>>> >> >>>> >>>> >> >>>> >>>> >> >>>> -- >>>> >> >>>> >>>> >> >>>> Andrija Panić >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> -- >>>> >> >>> С уважением, Фасихов Ирек Нургаязович >>>> >> >>> Моб.: +79229045757 >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> -- >>>> >> >> С уважением, Фасихов Ирек Нургаязович >>>> >> >> Моб.: +79229045757 >>>> >> > >>>> >> > >>>> >> > _______________________________________________ >>>> >> > ceph-users mailing list >>>> >> > ceph-users@lists.ceph.com >>>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >> > >>>> > >>>> > >>>> > >>>> > >>>> > -- >>>> > >>>> > Andrija Panić >>>> >>> >>> >>> >>> -- >>> >>> Andrija Panić >>> >> >> >> >> -- >> >> Andrija Panić >> > > -- Andrija Panić
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com