osd_recovery_delay_start - is the delay in seconds between iterations recovery (osd_recovery_max_active)
It is described here: https://github.com/ceph/ceph/search?utf8=%E2%9C%93&q=osd_recovery_delay_start 2015-03-03 14:27 GMT+03:00 Andrija Panic <andrija.pa...@gmail.com>: > Another question - I mentioned here 37% of objects being moved arround - > this is MISPLACED object (degraded objects were 0.001%, after I removed 1 > OSD from cursh map (out of 44 OSD or so). > > Can anybody confirm this is normal behaviour - and are there any > workarrounds ? > > I understand this is because of the object placement algorithm of CEPH, > but still 37% of object missplaces just by removing 1 OSD from crush maps > out of 44 make me wonder why this large percentage ? > > Seems not good to me, and I have to remove another 7 OSDs (we are demoting > some old hardware nodes). This means I can potentialy go with 7 x the same > number of missplaced objects...? > > Any thoughts ? > > Thanks > > On 3 March 2015 at 12:14, Andrija Panic <andrija.pa...@gmail.com> wrote: > >> Thanks Irek. >> >> Does this mean, that after peering for each PG, there will be delay of >> 10sec, meaning that every once in a while, I will have 10sec od the cluster >> NOT being stressed/overloaded, and then the recovery takes place for that >> PG, and then another 10sec cluster is fine, and then stressed again ? >> >> I'm trying to understand process before actually doing stuff (config >> reference is there on ceph.com but I don't fully understand the process) >> >> Thanks, >> Andrija >> >> On 3 March 2015 at 11:32, Irek Fasikhov <malm...@gmail.com> wrote: >> >>> Hi. >>> >>> Use value "osd_recovery_delay_start" >>> example: >>> [root@ceph08 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.94.asok >>> config show | grep osd_recovery_delay_start >>> "osd_recovery_delay_start": "10" >>> >>> 2015-03-03 13:13 GMT+03:00 Andrija Panic <andrija.pa...@gmail.com>: >>> >>>> HI Guys, >>>> >>>> I yesterday removed 1 OSD from cluster (out of 42 OSDs), and it caused >>>> over 37% od the data to rebalance - let's say this is fine (this is when I >>>> removed it frm Crush Map). >>>> >>>> I'm wondering - I have previously set some throtling mechanism, but >>>> during first 1h of rebalancing, my rate of recovery was going up to 1500 >>>> MB/s - and VMs were unusable completely, and then last 4h of the duration >>>> of recover this recovery rate went down to, say, 100-200 MB.s and during >>>> this VM performance was still pretty impacted, but at least I could work >>>> more or a less >>>> >>>> So my question, is this behaviour expected, is throtling here working >>>> as expected, since first 1h was almoust no throtling applied if I check the >>>> recovery rate 1500MB/s and the impact on Vms. >>>> And last 4h seemed pretty fine (although still lot of impact in general) >>>> >>>> I changed these throtling on the fly with: >>>> >>>> ceph tell osd.* injectargs '--osd_recovery_max_active 1' >>>> ceph tell osd.* injectargs '--osd_recovery_op_priority 1' >>>> ceph tell osd.* injectargs '--osd_max_backfills 1' >>>> >>>> My Jorunals are on SSDs (12 OSD per server, of which 6 journals on one >>>> SSD, 6 journals on another SSD) - I have 3 of these hosts. >>>> >>>> Any thought are welcome. >>>> -- >>>> >>>> Andrija Panić >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>> >>> >>> -- >>> С уважением, Фасихов Ирек Нургаязович >>> Моб.: +79229045757 >>> >> >> >> >> -- >> >> Andrija Panić >> > > > > -- > > Andrija Panić > -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com